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Abstract 

We prove an average-case depth hierarchy theorem for Boolean circuits over the standard 
basis of AND, OR, and NOT gates. Our hierarchy theorem says that for every d > 2, there is 
an explicit u-variable Boolean function /, computed by a linear-size depth-d formula, which is 
such that any depth-(d — 1) circuit that agrees with / on (l/2-|-o„(l)) fraction of all inputs 
must have size exp(n^O/rf)), This answers an open question posed by Hastad in his Ph.D. thesis 
[Has86b]. 

Our average-case depth hierarchy theorem implies that the polynomial hierarchy is infinite 
relative to a random oracle with probability 1, confirming a conjecture of Hastad [Has86a], 
Cai [Cai86], and Babai [Bab87]. We also use our result to show that there is no “approximate 
converse” to the results of Linial, Mansour, Nisan [LMN93] and Boppana [Bop97] on the to¬ 
tal influence of small-depth circuits, thus answering a question posed by O’Donnell [O’D07], 
Kalai [Kall2], and Hatami [Hatl4]. 

A key ingredient in our proof is a notion of random projections which generalize random 
restrictions. 
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1 Introduction 


The study of small-depth Boolean circuits is one of the great success stories of complexity the¬ 
ory. The exponential lower bounds against constant-depth AND-OR-NOT circuits [Yao85, Has86a, 
Raz87, Smo87] remain among our strongest unconditional lower bounds against concrete models of 
computation, and the techniques developed to prove these results have led to significant advances in 
computational learning theory [LMN93, Man95], pseudorandomness [Nis91, Baz09, Raz09, BralO], 
proof complexity [PBI93, Ajt94, KPW95], structural complexity [Yao85, Has86a, Cai86], and even 
algorithm design [Will4a, Will4b, AWY15]. 

In addition to worst-case lower bounds against small-depth circuits, average-case lower bounds, 
or correlation bounds, have also received significant attention. As one recent example, Impagliazzo, 
Matthews, Paturi [IMP12] and Hastad [Hasl4] independently obtained optimal bounds on the 
correlation of the parity function with small-depth circuits, capping off a long line of work on the 
problem [Ajt83, Yao85, Has86a, Cai86, Bab87, BIS12]. These results establish strong limits on 
the computational power of constant-depth circuits, showing that their agreement with the parity 
function can only be an exponentially small fraction better than that of a constant function. 

In this paper we will be concerned with average-case complexity within the class of small-depth 
circuits: our goal is to understand the computational power of depth-d circuits relative to those of 
strictly smaller depth. Our main result is an average-ease depth hierarchy theorem for small-depth 
circuits: 

Theorem 1. Let 2 < d < , where c> 0 is an absolute constant, and Sipser^ be the explieit 

n-variable read-once monotone depth-d formula described in Section 6. Then any circuit C of 

1 

depth at most d — 1 and size at most S = over {0,1}"' agrees with Sipser^ on at most 

(| -|- • 2" inputs. 

(We actually prove two incomparable lower bounds, each of which implies Theorem 1 as a 
special case. Roughly speaking, the first of these says that Sipser^ cannot be approximated by size¬ 
s', depth-d circuits which have significantly smaller bottom fan-in than Sipser^, and the second of 
these says that Sipser^ cannot be approximated by size-S, depth-d circuits with a different top-level 
output gate than Sipser^.) 

Theorem 1 is an average-case extension of the worst-case depth hierarchy theorems of Sipser, 
Yao, and Hastad [Sip83, Yao85, Has86a], and answers an open problem of Hastad [Has86a] (which 
also appears in [Has86b, Has89]). We discuss the background and context for Theorem 1 in Sec¬ 
tion 1.1, and state our two main lower bounds more precisely in Section 1.2. 


Applications. We give two applications of our main result, one in structural complexity and the 
other in the analysis of Boolean functions. First, via a classical connection between small-depth 
computation and the polynomial hierarchy [FSS81, Sip83], Theorem 1 implies that the polynomial 
hierarchy is infinite relative to a random oracle: 


Theorem 2. 


With probability 1, a random oracle A satisfies S 


p,A 

d 


c S 


P,A 

d+1 


for all d G N. 


This resolves a well-known conjecture in structural complexity, which first appeared in [Has86a, 
Cai86, Bab87] and has subsequently been discussed in a wide range of surveys [Joh86, Hem94, 
ST95, HRZ95, VW97, Aar], textbooks [DKOO, HO02], and research papers [Has86b, Has89, Tar89, 
For99, AarlOaj. (Indeed, the results of [Has86a, Cai86, Bab87], along with much of the pioneering 
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work on lower bounds against small-depth circuits in the 1980’s, were largely motivated by the 
aforementioned connection to the polynomial hierarchy.) See Section 2 for details. 

Our second application is a strong negative answer to questions of Kalai, Hatami, and O’Donnell 
in the analysis of Boolean functions. Seeking an approximate converse to the fundamental results of 
Linial, Mansour, Nisan [LMN93] and Boppana [Bop97] on the total influence of small-depth circuits, 
Kalai asked whether every Boolean function with total influence polylog(n) can be approximated by 
a constant-depth circuit of quasipolynomial size [KallO, Kall2, Hatl4]. O’Donnell posed a variant 
of the same question with a more specific quantitative bound on how the size of the approximating 
circuit depends on its influence and depth [O’D07]. As a consequence of Theorem 1 we obtain the 
following: 

Theorem 3. There are functions d{n) = ujn{^) dn-d S{n) = exp((logn)‘^"^^)) such that there is a 
monotone f : {0,1}” —)■ {0,1} with total influence Inf(/) = O(logn), but any circuit C that has 
depth d{n) and agrees with f on at least -|- o„(l)) • 2” inputs in {0,1}” must have size greater 
than S{n). 

Theorem 3 significantly strengthens O’Donnell and Wimmer’s counterexample [OW07] to a 
conjecture of Benjamini, Kalai, and Schramm [BKS99], and shows that the total influence bound 
of [LMN93, Bop97] does not admit even a very weak approximate converse. See Section 3 for 
details. 

1.1 Previous work 

In this subsection we discuss previous work related to our average-case depth hierarchy theorem. 
We discuss the background and context for our applications, Theorems 2 and 3, in Sections 2 and 3 
respectively. 

Sipser was the first to prove a worst-case depth hierarchy theorem for small-depth circuits [Sip83] . 
He showed that for every d G N, there exists a Boolean function Fa : {0,1}” —>■ {0,1} such that 
is computed by a linear-size depth-d circuit, but any depth-(d — 1) circuit computing F^ has size 
”), where log^*^ n denotes the i-th iterated logarithm. The family of functions {Td}deN 
witnessing this separation are depth-d read-once monotone formulas with alternating layers of 
AND and OR gates with fan-in — these came to be known as the Sipser functions. Following 
Sipser’s work, Yao claimed an improvement of Sipser’s lower bound to exp(n'’‘*) for some constant 
Cd > 0 [Yao85]. Shortly thereafter Hastad proved a near-optimal separation for (a slight variant 
of) the Sipser functions: 

Theorem 4 (Depth hierarchy of small-depth circuits [Has86a]; see also [Has86b, Has89]). For 
every d G N, there exists a Boolean function Fd : {0,1}” —>■ {0,1} such that Fd is computed by a 
linear-size depth-d circuit, but any depth-{d— 1) circuit computing Fd has size exp(re^^^/'’*^). 

The parameters of Hastad’s theorem were subsequently refined by Cai, Chen, and Hastad [CCH98] , 
and Segerlind, Buss, and Impagliazzo [SBI04] . Prior to the work of Yao and Hastad, Klawe, Paul, 
Pippenger, and Yannakakis [KPPY84] proved a depth hierarchy theorem for small-depth monotone 
circuits, showing that for every d G N, depth-(d—1) monotone circuits require size exp(D(n^/('^“^))) 
to compute the depth-d Sipser function. Klawe et al. also gave an upper bound, showing that every 
linear-size monotone formula — in particular, the depth-d Sipser function for all d G N — can be 
computed by a depth-A: monotone formula of size exp(0(/c for all /c G N. 
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To the best of our knowledge, the first progress towards an average-case depth hierarchy theorem 
for small-depth circuits was made by O’Donnell and Wimmer [OW07]. They constructed a linear- 
size depth-3 circuit F and proved that any depth-2 circuit that approximates F must have size 

2p‘{'n/ log n). 


Theorem 5 (Theorem 1.9 of [OW07]). For re G N and n := , let Tribes : {0,1}"’ —)• {0,1} 

he the function computed by a 2"^-term read-once monotone DNF formula where every term has 
width exactly w. Let Tribes^ denote its Boolean dual, the function computed by a 2^-clause read- 
once monotone CNF formula where every clause has width exactly w, and define the 2n-variable 
function F : {0,1}^” —> {0,1} as 

F{x) = Tribes(a:i,.. ., V Tribes'f(xn+i, ..., X 2 n)- 

Then any depth-2 circuit C on 2n variables that has size agrees with F on at most a 

0.99-fraction of the 2^” inputs. (Note that F is computed by a linear-size depth-3 circuit.) 

Our Theorem 1 gives an analogous separation between depth-d and depth-(d-|- 1) for all d>2, 
with (1/2 — Ori,(l))-inapproximability rather than 0.01-inapproximability. The [OW07] size lower 
bound of is much larger, in the case d = 2, than our exp(n^^^/'^^) size bound. However, 

we recall that achieving a exp(a;(n^/^'’*“^^)) lower bound against depth-d circuits for an explicit 
function, even for worst-case computation, is a well-known and major open problem in complexity 
theory (see e.g. Chapter §11 of [Jukl2] and [Val83, GW13, Viol3]). In particular, an extension of 
the 2^("'/P°hiog(«))_type lower bound of [OW07] to depth 3, even for worst-case computation, would 
constitute a significant breakthrough. 

1.2 Our main lower bounds 

We close this section with precise statements of our two main lower bound results, a discussion of 
the (near)-optimahty of our correlation bounds, and a very high-level overview of our techniques. 

Theorem 6 (First main lower bound). For 2 < d < the n-variable Sipser^ function has 

1 

the following property: Any depth-d circuit C : {0,1}"' —)■ {0,1} of size at most S = and 

bottom fan-in agrees with Sipser^ on at most (^ -b • 2"' inputs. 

Theorem 7 (Second main lower bound). For 2 < d < , the n-variable Sipser^ function has 

1 

the following property: Any depth-d circuit C : {0,1}"' —)■ {0,1} of size at most S = 2’^*^^'^ and 
the opposite alternation pattern to Sipser^ (i.e. its top-level output gate is OR if S'lpser^’s is AND 
and vice versa) agrees with Sipser^ on at most (^ -|- • 2'^ inputs. 

Clearly both these results imply Theorem 1 as a special case, since any size-5 depth-(d — 1) 
circuit may be viewed as a size-5 depth-d circuit satisfying the assumptions of Theorems 6 and 7. 

(Near)-optimality of our correlation bounds. For constant d, our main result shows that 
the depth-d Sipser^ function has correlation at most (1/2 -|- with any subexponential-size 

circuit of depth d—1. Since Sipser^ is a monotone function, well-known results [BT96] imply that its 
correlation with some input variable Xi or one of the constant functions 0,1 (trivial approximators 
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of depth at most one) must be at least (1/2 + n(l/n)); thus significant improvements on our 
correlation bound cannot be achieved for this (or for any monotone) function. 

What about non-monotone functions? If {fd}d >2 is any family of n-variable functions computed 
by poly(n)-size, depth-d circuits, the “discriminator lemma” of Hajnal et al. [HMP^93] implies that 
fd must have correlation at least (l/2-|-n“*^(^i) with one of the depth-((i —1) circuits feeding into its 
topmost gate. Therefore a “d versus d — 1” depth hierarchy theorem for correlation (1/2 -|- 
does not hold. 

Our techniques. Our approach is based on random projections, a generalization of random 
restrictions. At a high level, we design a carefully chosen (adaptively chosen) sequence of random 
projections, and argue that with high probability under this sequence of random projections, (i) any 
circuit C of the type specified in Theorem 6 or Theorem 7 “collapses,” while (ii) the Sipser^ function 
“retains structure,” and (iii) moreover this happens in such a way as to imply that the circuit C 
must have originally been a very poor approximator for Sipser^ (before the random projections). 
Each of (i)-(iii) above requires signihcant work; see Section 4 for a much more detailed explanation 
of our techniques (and of why previous approaches were unable to successfully establish the result). 

2 Application ^ 1 : Random oracles separate the polynomial hier¬ 
archy 

2.1 Background: PSPACE ^ PH relative to a random oracle 

The pioneering work on lower bounds against small-depth circuits in the 1980’s was largely mo¬ 
tivated by a connection between small-depth computation and the polynomial hierarchy shown 
by Furst, Saxe, and Sipser [FSS81]. They gave a super-polynomial size lower bound for constant- 
depth circuits, proving that depth-d circuits computing the n-variable parity function must have size 
where log^*^ n denotes the i-th iterated logarithm. They also showed that an improve¬ 
ment of this lower bound to super-quasipolynomial for constant-depth circuits (i.e. nrf(2(*°® 
for all constants k) would yield an oracle A such that PSPACE^ ^ PH"^. Ajtai independently 
proved a stronger lower bound of [Ajt83]; his motivation came from finite model theory. 

Yao gave the first super-quasipolynomial lower bounds on the size of constant-depth circuits com¬ 
puting the parity function [Yao85], and shortly after Hastad proved the optimal lower bound of 
exp(Q(n^A'^-i))) influential Switching Lemma [Has86a]. 

Yao’s relativized separation of PSPACE from PH was improved qualitatively by Cai, who showed 
that the separation holds even relative to a random oracle [Cai86]. Leveraging the connection made 
by [FSS81], Cai accomplished this by proving correlation bounds against constant-depth circuits, 
showing that constant-depth circuits of sub-exponential size agree with the parity function only on 
a (1/2 -|- On(l)) fraction of inputs. (Independent work of Babai [Bab87] gave a simpler proof of the 
same relativized separation.) 

2.2 Background: The polynomial hierarchy is infinite relative to some oracle 

Together, these results paint a fairly complete picture of the status of the PSPACE versus PH 
question in relativized worlds: not only does there exist an oracle A such that PSPACE"^ ^ PH"^, 
this separation holds relative to almost all oracles. A natural next step is to seek analogous results 
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showing that the relativized polynomial hierarchy is infinite; we recall that the polynomial hierarchy 
being infinite implies PSPACE 7 ^ PH, and furthermore, this implication relativizes. We begin with 
the following question, attributed to Albert Meyer in [BGS75]: 

Meyer’s Question. Is there a relativized world within which the polynomial hierarchy is infinite? 

PA PA 

Equivalently, does there exist an oracle A such that C for all d £¥1? 

Early work on Meyer’s question predates [FSS81]. It was first considered by Baker, Gill, and 
Solovay in their paper introducing the notion of relativization [BGS75], in which they prove the 
existence of an oracle A such that P"^ NP"^ 7 ^ coNP"^, answering Meyer’s question in the af¬ 
firmative for d £ {0,1}. Subsequent work of Baker and Selman proved the d = 2 case [BS79]. 
Following [FSS81], Sipser noted the analogous connection between Meyer’s question and circuit 
lower bounds [Sip83]: to answer Meyer’s question in the affirmative, it suffices to exhibit, for ev¬ 
ery constant d £ ¥\, a Boolean function computable by a depth-d AC° circuit such that any 
depth-(d — 1) circuit computing requires super-quasipolynomial size. (This is a significantly 
more delicate task than proving super-quasipolynomial size lower bounds for the parity function; 
see Section 4 for a detailed discussion.) Sipser also constructed a family of Boolean functions for 
which he proved an n versus separation — these came to be known as the Sipser func¬ 

tions, and they play the same central role in Meyer’s question as the parity function does in the 
relativized PSPACE versus PH problem. 

As discussed in the introduction (see Theorem 4) , Hastad gave the hrst proof of a near-optimal n 
versus exp(n^^^/'^^) separation for the Sipser functions [Has 86 a], obtaining a strong depth hierarchy 
theorem for small-depth circuits and answering Meyer’s question in the affirmative for all d G N. 

2.3 This work: The polynomial hierarchy is infinite relative to a random oracle 

Given Hastad’s result, a natural goal is to complete our understanding of Meyer’s question by 
showing that the polynomial hierarchy is not just infinite with respect to some oracle, but in fact 
with respect to almost all oracles. Indeed, in [Has 86 a, Has 86 b, Has89], Hastad poses the problem 
of extending his result to show this as an open question: 

Question 1 (Meyer’s Question for Random Oracles [Has 86 a, Has 86 b, Has89]). Is the polynomial 
hierarchy infinite relative to a random oracle? Equivalently, does a random oracle A satisfy C 
for addeN? 

Question 1 also appears as the main open problem in [Cai 86 , Bab87]; as mentioned above, an 
affirmative answer to Question 1 would imply Cai and Babai’s result showing that PSPACE"^ 7 ^ PH^ 
relative to a random oracle A. Further motivation for studying Question 1 comes from a surprising 
result of Book, who proved that the unrelativized polynomial hierarchy collapses if it collapses 
relative to a random oracle [Boo94] . Over the years Question 1 has been discussed in a wide range 
of surveys [Joh 86 , Hem94, ST95, HRZ95, VW97, Aar], textbooks [DKOO, HO02], and research 
papers [Has 86 b, Has89, Tar89, For99, AarlOa]. 

Our work. As a corollary of our main result (Theorem 1) — an average-case depth hierarchy 
theorem for small-depth circuits — we answer Question 1 in the affirmative for all d G N: 

Theorem 2. The polynomial hierarchy is infinite relative to a random oracle: with probability 1, 
a random oracle A satisfies C for all d £¥i. 
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Prior to our work, the d G {0,1} cases were proved by Bennett and Gill in their paper initiating 
the study of random oracles [BG81]. Motivated by the problem of obtaining relativized separations 
in quantum structural complexity, Aaronson recently showed that a random oracle A separates 
H 2 from [AarlOb, AarlOa]; he conjectures in [AarlOa] that his techniques can be extended 
to resolve the d = 2 case of Theorem 2. We observe that O’Donnell and Wimmer’s techniques 
(Theorem 5 in our introduction) can be used to prove the d = 2 case [OW07], though the authors 
of [OW07] do not discuss this connection to the relativized polynomial hierarchy in their paper. 



PSPACE^ ^ PH^ 

T for all d G N 

Gonnection to lower bounds 
for constant-depth circuits 

[FSS81] 

[Sip83] 

Hard function(s) 

Parity 

Sipser functions 

Relative to some oracle A 

[Yao85, Has 86 a] 

[Yao85, Has 86 a] 

Relative to random oracle A 

[Gai 86 , Bab87] 

This work 


Table 1: Previous work and our result on the relativized polynomial hierarchy 

We refer the reader to Ghapter §7 of Hastad’s thesis [Has 86 b] for a detailed exposition (and com¬ 
plete proofs) of the aforementioned connections between small-depth circuits and the polynomial 
hierarchy (in particular, for the proof of how Theorem 2 follows from Theorem 1). 

3 Application #2: No approximate converse to Boppana—Linial— 
Mansour—Nisan 

The famous result of Linial, Mansour, and Nisan gives strong bounds on Fourier concentra¬ 
tion of small-depth circuits [LMN93]. As a corollary, they derive an upper bound on the to¬ 
tal influence of small-depth circuits, showing that depth-d size-5 circuits have total influence 
(0(log5))'^. (We remind the reader that the total influence of an n-variable Boolean function 
/ is Inf(/) := where Infj(/) is the probability that flipping coordinate i G [n] of 

a uniform random input from {0,1}"' causes the value of / to change.) This was subsequently 
sharpened by Boppana via a simpler and more direct proof [Bop97]: 

Theorem 8 (Boppana, LiniaDMansour-Nisan). Let f : {0, !}"■ —>■ {0,1} be a computed by a size-S 
depth-d circuit. Then Inf(/) = (0(log5))'’*“^. 

(We note that Boppana’s bound is asymptotically tight by considering the parity function.) 
Several researchers have asked whether an approximate converse of some sort holds for Theorem 8 : 

If f ■ {0) 1}” {0) 1} low total influence, is it the case that f can be approximated 

to high accuracy by a small constant-depth circuit? 

A result of this flavor, taken together with Theorem 8 , would yield an elegant characterization of 
Boolean functions with low total influence. In this section we formulate a very weak approximate 
converse to Theorem 8 and show, as a consequence of our main result (Theorem 1), that even this 
weak converse does not hold. 


6 









3.1 Background: BKS conjecture and O’Donnell—Wimmer’s counterexample 

An approximate converse to Theorem 8 was first conjectured by Benjamini, Kalai, and Schramm, 
with a very specific quantitative bound on how the size of the approximating circuit depends on 
its influence and depth [BKS99] (the conjecture also appears in the surveys [KalOO, KS05]). They 
posed the following: 

Benjamini—Kalai—Schramm (BKS) Conjecture. For every e > 0 there is a constant K = 
K(e) such that the following holds: Every monotone f : {0,1}” —)> {0,1} can be e-approximated by 
a depth-d circuit of size at most 

exp((K.Inf(/))V('i-i)) 

for some d >2. 

(We associate a circuit with the Boolean function that it computes, and we say that a circuit 
e-approximates a Boolean function / if it agrees with / on all but an e-fraction of all inputs.) If 
true, the BKS conjecture would give a quantitatively strong converse to Theorem 8 for monotone 
functions.^ In addition, it would have important implications for the study of threshold phenomena 
in Erdos-Renyi random graphs, which is the context in which Benjamini, Kalai, and Schramm made 
their conjecture; we refer the reader to [BKS99] and Section 1.4 of [OW07] for a detailed discussion 
of this connection. However, the BKS conjecture was disproved by O’Donnell and Wimmer [OW07]. 
Their result (Theorem 5 in our introduction) disproves the case d = 2 of the BKS conjecture, and 
the case d > 2 is disproved by an easy argument which [OW07] give. 

3.2 This work: Disproving a weak variant of the BKS conjecture 

A significantly weaker variant of the BKS conjecture is the following: 

Conjecture 1. For every e > 0 there is a d = d(e) and Ki = Ki(e),K 2 = K 2 (e) such that the 
following holds: Every monotone / : {0, !}"■ —)■ {0,1} can be e-approximated by a depth-d circuit of 
size at most 

exp ((Ki • Inf(/))^2) . 

The [OW07] counterexample to the BKS conjecture does not disprove Conjecture 1; indeed, 
the function / that [OW07] construct and analyze is computed by a depth-3 circuit of size 0{n)f 
Observe that Conjecture 1, if true, would yield the following rather appealing consequence: every 
monotone / : {0,1}” —)• {0,1} with total influence at most polylog(n) can be approximated to any 
constant accuracy by a quasipolynomial-size, constant-depth circuit (where both the constant in 
the quasipolynomial size bound and the constant depth of the circuit may depend on the desired 
accuracy). 

Following O’Donnell and Wimmer’s disproof of the BKS conjecture, several researchers have 
posed questions similar in spirit to Conjecture 1. O’Donnell asked if the BKS conjecture is true if 
the bound on the size of the approximating circuit is allowed to be exp ((AT • Inf (/))^/'’*) instead 

^We remark that although the BKS conjecture was stated for monotone Boolean functions, it seems that (a priori) 
it could have been true for all Boolean functions: prior to [OW07], we are not aware of any counterexample to the 
BKS conjecture even if / is allowed to be non-monotone. 

^As with the BKS conjecture, prior to our work we are not aware of any counterexample to Conjecture 1 even if 
/ is allowed to be non-monotone. 
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of exp (^{K ■ Inf [O’D07]. This is a weaker statement than the original BKS conjec¬ 
ture (in particular, it is not ruled out by the counterexample of [OW07]), but still significantly 
stronger than Conjecture 1. Subsequently Kalai asked if Boolean functions with total influence 
polylog(n) (resp. O(logn)) can be approximated by constant-depth circuits of quasipolynomial size 
(resp. AC°) [Kall2] (see also [KallO] where he states a qualitative version). Kalai’s question is a 
variant of Conjecture 1 in which / is allowed to be non-monotone, but Inf(/) is only allowed to 
be polylog(n); furthermore, K 2 {s) is only allowed to be 1 if Inf(/) = O(logn). Finally, H. Hatami 
recently restated the Inf(/) = O(logn) case of Kalai’s question: 

Problem 4.6.3 of [Hatl4]. Is it the case that for every e, C > 0, there are constants d, k such 
that for every f : { 0 , 1 }” —)■ { 0 , 1 } with Inf(/) < Clogn, there is a size-n^, depth-d circuit which 
e-approximates f? 

Our work. As a corollary of our main result (Theorem 1 ), we show that Conjecture 1 is false 
even for (suitable choices of) e = ^ — o„(l). Our counterexample also provides a strong negative 
answer to O’Donnell’s and Kalai-Hatami’s versions of Conjecture 1. We prove the following: 

Theorem 3. Conjecture 1 is false. More precisely, there is a monotone f : {0,1}*^ —)• {0,1} and 
a 5{n) = On(l) such that Inf(/) = O(logn) but any circuit of depth d{n) = -yioglogn that agrees 

with f on {I + S{n)) fraction of all inputs must have size at least S{n) = 2^ 

Proof of Theorem 3 assuming Theorem 1. Consider the monotone Boolean function / : {0,1}"' —)■ 

{0,1} corresponding to Sipser^ of Theorem 1 defined over the first m = variables, and 

of depth d = [log log mj -|- 1 = [-v/log log nj -|- 1. By Boppana’s theorem (Theorem 8), we have that 

Inf(/) = 0(logm)'’*“^ = O ^ ^ = O(logn). 

On the other hand, our main theorem (Theorem 1) implies that even circuits of depth d — 1 = 

[\/log log nJ which agree with / on (^-|-J(re)) fraction of all inputs, where h(n) = 2 f^( 2 log / [^log log nJ) ^ 
must have size at least 

5(n) = 2"* ' =20 ) = 2 ^^ . □ 


4 Our techniques 

The method of random restrictions dates back to Subbotovskaya [Sub61] and continues to be an 
indispensable technique in circuit complexity. Focusing only on small-depth circuits, we mention 
that the random restriction method is the common essential ingredient underlying the landmark 
lower bounds discussed in the previous sections [FSS81, Ajt83, Sip83, Yao85, Has 86 a, Cai 86 , Bab87, 
IMP12, Hasl4]. 

We begin in Section 4.1 by describing the general framework for proving worst- and average- 
case lower bounds against small-depth circuits via the random restriction method. Within this 
framework, we sketch the now-standard proof of correlation bounds for the parity function based 
on Hastad’s Switching Lemma. We also recall why the lemma is not well-suited for proving a 
depth hierarchy theorem for small-depth circuits, hence necessitating the “blockwise variant” of 














the lemma that Hastad developed and applied to prove his (worst-case) depth hierarchy theorem. 
In Section 4.2 we highlight the difficulties that arise in extending Hastad’s depth hierarchy theorem 
to the average-case, and how our techniques — specifically, the notion of random projections — 
allow us to overcome these difficulties. 

4.1 Background: Lower bounds via random restrictions 

Suppose we would like to show that a target function f : {0, !}"■ —>■ {0,1} has small correlation 
with any size-S depth-d approximating circuit C under the uniform distribution U over {0, !}”■. 
A standard approach is to construct a series of random restrictions {'Tlk}k&{ 2 ,...,d} satisfying three 
properties: 

- Property 1: Approximator C simplifies. The randomly-restricted circuit C \ 

where ■<— TZk for 2 < k < d, should “collapse to a simple function” with high probability. 
This is typically shown via iterative applications of an appropriate “Switching Lemma for the 
T^fc’s”, which shows that each random restriction p^^^ decreases the depth of the circuit C \ 
p{d) ... p(fc-i) by one with high probability. The upshot is that while (7 is a depth-d size-5 
circuit, C \ p^'^^ • • • p^^^ will be a small-depth decision tree, a “simple function”, with high 
probability. 

- Property 2: Target / retains structure. In contrast with the approximating circuit, the 
target function / should (roughly speaking) be resilient against the random restrictions p^^'^ <r- 
TZk- While the precise meaning of “resilient” depends on the specihc application, the key property 
we need is that / \ p^'^^ • • • p*-^^ will with high probability be a “well-structured” function that 
is uncorrelated with any small-depth decision tree. 

Together, these two properties imply that random restrictions of / and C are uncorrelated with 
high probability. Note that this already yields worst-case lower bounds, showing that / : {0,1}” —)■ 
{0,1} cannot be computed exactly by C. To obtain correlation bounds, we need to translate such 
a statement into the fact that / and C themselves are uncorrelated. For this we need the third key 
property of the random restrictions: 

- Property 3: Composition of completes to lA. Evaluating a Boolean function h : 

{0,1}*^ —)> {0,1} on a random input X ZY is equivalent to hrst applying random restrictions 
p(‘^\ ..., p(^) to h, and then evaluating the randomly-restricted function h \ p^'^) • • • p(^) on 

yj ^U. 

Correlation bounds for parity. For uniform-distribution correlation bounds against constant- 
depth circuits computing the parity function, the random restrictions are all drawn from TZ{p), the 
“standard” random restriction which independently sets each free variable to 0 with probability 
2(1 — p), to 1 with probability ^(1 — p), and keeps it free with probability p. The main technical 
challenge arises in proving that Property 1 holds — this is precisely Hastad’s Switching Lemma — 
whereas Properties 2 and 3 are straightforward to show. For the second property, we note that 

Parity^ f p = ± Parity(p“^(=t:)) for all restrictions p G {0,1, *}"■, 

and so Parity^ \ • • • p*^^^ computes the parity of a random subset S C [n] of coordinates (or its 

negation). With an appropriate choice of the ^-probability p we have that |S| is large with high 
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probability; recall that ± Parity^ (the A:-variable parity function or its negation) has zero correlation 
with any decision tree of depth at most k — 1. For the third property, we note that for all values 
of p G (0,1), a random restriction p ■<— TZ{p) specifies a uniform random subcube of {0,1}” (of 
dimension |p~^(*)|). Therefore, the third property is a consequence of the simple fact that a uniform 
random point within a uniform random subcube is itself a uniform random point from {0,1}”. 

Hastad’s blockwise random restrictions. With the above framework in mind, we notice a 
conceptual challenge in proving AC*^ depth hierarchy theorems via the random restriction method: 
even focusing only on the worst-case (i.e. ignoring Property 3), the random restrictions TZk will 
have to satisfy Properties 1 and 2 with the target function / being computable in AC°. This is 
a significantly more delicate task than (say) proving Parity ^ AC° since, roughly speaking, in the 
latter case the target function / = Parity is “much more complex” than the circuit C G AC^ to 
begin with. In an AC° depth hierarchy theorem, both the target / and the approximating circuit C 
are constant-depth circuits; the target / is “more complex” than C in the sense that it has larger 
circuit depth, but this is offset by the fact that the circuit size of C is allowed to be exponentially 
larger than that of / (as is the case in both Hastad’s and our theorem). We refer the reader to 
Chapter §6.2 of Hastad’s thesis [Has86b] which contains a discussion of this very issue. 

Hastad overcomes this difficulty by replacing the “standard” random restrictions TZ{p) with 
random restrictions specifically suited to Sipser functions being the target: his “blockwise” random 
restrictions are designed so that (1) they reduce the depth of the formula computing the Sipser 
function by one, but otherwise essentially preserve the rest of its structure, and yet (2) a switching 
lemma still holds for any circuit with sufficiently small bottom fan-in. These correspond to Prop¬ 
erties 2 and 1 respectively. However, unlike TZ{p), Hastad’s blockwise random restrictions are not 
independent across coordinates and do not satisfy Property 3: their composition does not complete 
to the uniform distribution U (and indeed it does not complete to any product distribution). This 
is why Hastad’s construction establishes a worst-case rather than average-case depth hierarchy 
theorem. 

4.2 Our main technique: Random projections 

The crux of the difficulty in proving an average-case AC^ depth hierarchy theorem therefore lies in 
designing random restrictions that satisfy Properties 1, 2, and 3 simultaneously, for a target / in 
AC° and an arbitrary approximating circuit C of smaller depth but possibly exponentially larger 
size. To recall, the “standard” random restrictions TZ{p) satisfy Properties 1 and 3 but not 2, and 
Hastad’s blockwise variant satisfies Properties 1 and 2 but not 3. 

In this paper we overcome this difficulty with projections, a generalization of restrictions. Given 
a set of formal variables X = {xi,... , Xn}, a restriction p either fixes a variable Xi (i.e. p{xi) G {0,1}) 
or keeps it alive (i.e. p{xi) = Xi, often denoted by *). A projection, on the other hand, either fixes 
Xi or maps it to a variable yj from a possibly different space of formal variables y = {yi, ■ ■ ■ ,yn'}- 
Restrictions are therefore a special case of projections where y = X, and each Xi can only be fixed 
or mapped to itself. (See Definition 4 for precise definitions.) Our arguments crucially employ 
projections in which y is smaller than X, and where moreover each xt is only mapped to a specific 
element yj where j depends on f in a carefully designed way that depends on the structure of the 
formula computing the Sipser function. Such “collisions”, where blocks of distinct formal variables 
in X are mapped to the same new formal variable yt G y, play a crucial role in our approach. (We 
remark that ours is not the first work to consider such a generalization of restrictions. Random 
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projections are also used in the work of Impagliazzo and Segerlind, which establishes lower bounds 
against constant-depth Frege systems with counting axioms in proof complexity [ISOl].) 

At a high level, our overall approach is structured around a sequence ^ of {adaptively chosen) 
random projections satisfying Properties 1, 2, and 3 simultaneously, with the target / being Sipser, 
a slight variant of the Sipser function which we define in Section 6. We briefly outline how we 
establish each of the three properties (it will be more natural for us to prove them in a slightly 
different order from the way they are listed in Section 4.1): 

- Property 3: ^ completes to the uniform distribution. Like Hastad’s blockwise random 
restrictions (and unlike the “standard” random restrictions TZ{p)), the distributions of our ran¬ 
dom projections are not independent across coordinates: they are carefully correlated in a way 
that depends on the structure of the formula computing Sipser. As discussed above, there is 
an inherent tension between the need for such correlations on one hand (to ensure that Sipser 
“retains structure”), and the requirement that their composition completes to the uniform dis¬ 
tribution on the other hand (to yield uniform-distribution correlation bounds). We overcome 
this difficulty with our notion of projections: in Section 8 we prove that the composition 

of our sequence of random projections completes to the uniform distribution (despite the fact 
that every one of the individual random projections comprising 'J' is highly-correlated among 
coordinates.) 

- Property 1: Approximator C simplifies. Next we prove that approximating circuits C of 
the types specified in our main lower bounds (Theorems 6 and 7) “collapse to a simple function” 
with high probability under our sequence 'I' of random projections. Following the standard 
“bottom-up” approach to proving lower bounds against small-depth circuits, we establish this 
by arguing that each of the individual random projections comprising 'J' “contributes to the 
simplification” of C by reducing its depth by (at least) one. 

More precisely, in Section 9 we prove a projection switching lemma, showing that a small-width 
DNF or CNF “switches” to a small-depth decision tree with high probability under our random 
projections. (The depth reduction of C follows by applying this lemma to every one of its 
bottom-level depth-2 subcircuits.) Recall that the random projection of a depth-2 circnit over 
a set of formal variables X yields a function over a new set of formal variables y, and in our 
case y is significantly smaller than X. In addition to the structural simplification that results 
from setting variables to constants (as in Hastad’s Switching Lemma for random restrictions), 
the proof of our projection switching lemma also crucially exploits the additional structural 
simplification that results from distinct variables in X being mapped to the same variable in y. 

- Property 2: Target Sipser retains structure. Like Hastad’s blockwise random restrictions, 
our random projections are defined with the target function Sipser in mind; in particular, they 
are carefully designed so as to ensure that Sipser “retains structure” with high probability under 
their composition 

In Section 10.1 we define the notion of a “typical” outcome of our random projections, and prove 
that with high probability all the individual projections comprising are typical. (Since our 
sequence of random projections is chosen adaptively, this requires a careful definition of typicality 
to facilitate an inductive argument showing that our definition “bootstraps” itself.) Next, in 
Section 10.2 we show that typical projections have a “very limited and well-controlled” effect 
on the structure of Sipser; equivalently, Sipser is resilient against typical projections. Together, 
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the results of Section 10.1 and 10.2 show that with high probability, Sipser reduces under 'J' to 
a “well-structured” formula, in sharp contrast with our results from Section 9 showing that the 
approximator “collapses to a simple function” with high probability under 'J'. 

We remark that the notion of random projections plays a key role in ensuring all three properties 
above. (We give a more detailed overview of our proof in Section 7.3 after setting up the necessary 
terminology and definitions in the next two sections.) 


5 Preliminaries 


5.1 Basic mathematical tools 


Fact 5.1 (Chernoff bounds). Let Zi,..., Z„ he independent random variables satisfying 0 < Zj < 1 
for all i £ [n]. Let S = Zi Z„, and p, = E[S]. Then for all 7 > 0, 


Pr[S > (1 -h 7 )m] < exp 

Pr[S < (1 - 7 )m] < exp (“y ' • 


We will use the following fact implicitly in many of our calculations: 

Fact 5.2. Let 6 = 6{n) > 0 and n G N, and suppose 6n = o„(l). The following inequalities hold 
for sufficiently large n: 

l-5n<{l-5T <1- l6n. 

Finally, the following standard approximations will be useful: 

Fact 5.3. For x >2, we have 


-1 


1 


1 -- 1 < 1 1 -- 
X 


1 


< e 


-1 


or equivalently, 


1 

1 - - 
X 


< e"^ < 1 1 - - 


X—1 


and for 0 < x < 1, we have l + x<e^<l + 2x. 

We write log to denote logarithm base 2 and In to denote natural log. 


5.2 Notation 

A DNF is an OR of ANDs (terms) and a CNF is an AND of ORs (clauses). The width of a 
DNF (respectively, CNF) is the maximum number of variables that occur in any one of its terms 
(respectively, clauses). We will assume throughout that our circuits are alternating, meaning that 
every root-to-leaf path alternates between AND gates and OR gates, and layered, meaning that for 
every gate G, every root-to-G path has the same length. By a standard conversion, every depth-d 
circuit is equivalent to a depth-d alternating layered circuit with only a modest increase in size 
(which is negligible given the slack on our analysis). The size of a circuit is its number of gates, 
and the depth of a circuit is the length of its longest root-to-leaf path. 

For p G [0,1] and symbols •, o, we write “{•p, oi-p}” to denote the distribution over {•, 0 } which 
outputs • with probability p and o with probability I — p. We write “ {"p, ” to denote the 
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product distribution over {•,o}^ in which each coordinate is distributed independently according 
to {•p, oi_p}. We write “ {•p, oi_p}^ \ ” to denote the product distribution conditioned on not 

outputting l*}^. 

Given r G {0,1, and a G we write Ta to denote the ^-character string G 

{0,1, and we sometimes refer to this as the “a-th block of r.” 

Throughout the paper we use boldfaced characters such as p, X, etc. to denote random variables. 
We write “a = 6 ± c” as shorthand to denote that a G [6 — c, 6 + c], and similarly a / 6 ± c to denote 
that a ^ [6 — c, 6 + c]. For a positive integer k we write “[A:]” to denote the set k}. 

The bias of a Boolean function / under an input distribution Z is defined as 

bias(/,Z) := min|pr[/(Z) = 0],Pr[/(Z) = 1]| . 


5.3 Restrictions and random restrictions 


Definition 1 (Restriction). A restriction p of a finite base set {xafaen of Boolean variables is 
a string p G {0,1,*}^. (We sometimes equivalently view a restriction p as a function p : —)■ 
{0,1, Given a function f : {0,1}^ —>• {0,1} and restriction p G {0,1, the p-restriction of 
f is the function {f \ p) : {0,1}^ —)• {0,1} where 


if \ P){x) = f{x \ p), and {x \ p)a := 


if Pa — * 
Pa otherwise 


for all a G n. 


Given a distribution TZ over restrictions {0,1,*}^ the 7^-random restriction of f is the random 
function f \ p where p TZ. 


Definition 2 (Refinement). Let p,T £ {0,1, be two restrictions. We say that t is a refinement 
of p if p~^{l) C r“^(l) and p~^(0) C t“^(0), i.e. every variable Xa that is set to 0 or 1 by p is set 
in the same way by r (and r may set additional variables to 0 or 1 that p does not set). 

Definition 3 (Composition). Let p, p' G {0,1, be two restrictions. Their composition, denoted 
pp' G {0,1,*}^, is the restriction defined by 



if Pa G {0,1} 
otherwise. 


Note that pp' is a refinement of p. 


5.4 Projections and random projections 

A key ingredient in this work is the notion of random projeetions which generalize random restric¬ 
tions. Throughout the paper we will be working with functions over spaces of formal variables 
that are partitioned into disjoint blocks of some length i (see Section 6 for a precise description 
of these spaces). In other words, our functions will be over spaces of formal variables that can be 
described as A = {xa^i: a £ A,i £ [£]}, where we refer to Xa,i as the z-th variable in the a-th block. 
We associate with each such space X a smaller space y = {ya '. a £ A} containing a new formal 
variable for each block of X. Given a function / over X, the projeetion of / yields a function over 
y, and the random projection of / is the projection of a random restriction of / (which again is a 
function over T)- Formally, we have the following definition: 
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Definition 4 (Projection). The projection operator proj acts on functions f : {0,> {0,1} 
as follows. The projection of / is the function (proj /) : {0,1}"^ —)■ {0,1} defined by 

(proj f){y) = f{x) where Xa,i = ya for all a £ A and i £[i]. 

Given a restriction p £ {0,1, the p-projection of / is the function (proj^/) : {0,1}"^ —)■ 

{0,1} defined by 

(projp f){y) = f{x) where Xa,i = | ttheruAsl all a £ A and i£[£]. 

Equivalently, (proj^/) = (proj (/ \ p)). Given a distribution TZ over restrictions in {0,1, 

the associated random projection operator is proj^ where p ^ TZ, and for f : {0,—>• {0,1} 

we call projp / its 7^-random projection. 

Note that when i = 1, the spaces X and y are identical and our definitions of a p-projection 
and 7^-random projection coincide exactly with that of a p-restriction and 7^-random restriction in 
Definition 1 (in this case the projection operator proj is simply the identity operator). 

Remark 9. The following interpretation of the projection operator will be useful for us. Let / be 
a function over X, and consider its representation as a circuit C (or decision tree) accessing the 
formal variables Xa,i in X. The projection of / is the function computed by the circuit C', where C' 
is obtained from C by replacing every occurrence of Xa,i in C by ya for all a G ^4 and i £[i]. Note 
that this may result in a significant simplification of the circuit: for example, an AND gate (OR 
gate, respectively) in C that access both Xa,i and Xa,j for some a £ A and i,j £ [i] will access both 
ya and y^ in C', and therefore can be simplified and replaced by the constant 0 (1, respectively). 
This is a fact we will exploit in the proof of our projection switching lemma in Section 9.1. 

6 The Sipser function and its basic properties 

For 2 < d G N, in this subsection we define the depth-d monotone n-variable read-once Boolean 
formula Sipser^ and establish some of its basic properties. The Sipser^ function is very similar to the 
depth-d formula considered by Hastad [Has86b] ; the only difference is that the fan-ins of the gates 
in the top and bottom layers have been slightly adjusted, essentially so as to ensure that the formula 
is very close to balanced between the two output values 0 and 1 (note that such balancedness is a 
prerequisite for any (1/2 — On(l))-inapproximability result.) The Sipser^ formula is defined in terms 
of an integer parameter m; in all our results this is an asymptotic parameter that approaches -|-cx), 
so m should be thought of as “sufficiently large” throughout the paper. 

Every leaf of Sipser^^ occurs at the same depth (distance from the root) d; there are exactly 
n leaves {n will be defined below) and each variable occurs at precisely one leaf. The formula is 
alternating, meaning that every root-to-leaf path alternates between AND gates and OR gates; all 
of the gates that are adjacent to input variables (i.e. the depth-(d— 1) gates) are AND gates, so the 
root is an OR gate if d is even and is an AND gate if d is odd. The formula is also depth-regular, 
meaning that for each depth (distance from the root) 0 < A: < d — 1, all of the depth-A: gates have 
the same fan-in. Hence to completely specify the Sipser^ formula it remains only to specify the 
fan-in sequence wo,..., Wd-i, where Wk is the fan-in of every gate at depth k. These fan-ins are as 
follows: 
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- The bottommost fan-in is 

Wd-i := m. (1) 

We define 

p ;= = 2“”", (2) 

and we observe that p is the probability that a depth-(d — 1) AND gate is satisfied by a 
uniform random choice of X {0i/25 

- For each value 1 < k < d — 2, the value of is Wk = w where 

w := [m2™'/log(e)J. (3) 

- The value wq is defined to be 

Wo := the smallest integer such that (1 — is at most (4) 

where and g will be defined in Section 7.1, see specifically Equations (8) and (7). Roughly 
speaking, wg is chosen so that the overall formula is essentially balanced under the uniform 
distribution (i.e. Sipser^ satisfies (6) below); see (9) and the discussion thereafter. 

The number of input variables n for Sipser^ is n = 0^=0 = w'^~‘^Wd-iWo- The estimates for 

ti and q given in (10) imply that wg = 2™’ln(2) • (1 ± Om(l)), so we have that 

We note that for the range of values 2 < d < that we consider in this paper, a direct 

(but somewhat tedious) analysis implies that the Sipser^ function is indeed essentially balanced, or 
more precisely, that it satisfies 

V 1 [Sipserrf(X) = 1] = ^ ±On(l). (6) 

However, since this fact is a direct byproduct of our main theorem (which shows that Sipser^ cannot 
be (1/2 — On(l))-approximated by any depth-(d— 1) formula, let alone by a constant function), we 
omit the tedious direct analysis here. 

We specify an addressing scheme for the gates and input variables of our Sipser^ formula which 
will be heavily used throughout the paper. Let Ag = {output}, and for 1 < A; < d, let = 
Ak-i X [rcfc_i]. An element of Ak specifies the address of a gate at depth (distance from the output 
node) k in Sipser^ in the obvious way; so Ad = {output} x [tco] x • • • x [wd-i] is the set of addresses 
of the input variables and \Ad\ = n. 

We close this section by introducing notation for the following family of formulas related to 
Sipser^; 

Definition 5. For 1 < k < d, we write Sipser^^^ : {0, l}"^*" —>■ {0,1} to denote the depth-k formula 
obtained from Sipser^ by discarding all gates at depths k + 1 through d — 1, and replacing every 
depth-k gate at address a £ A^ with a fresh formal variable ya- 

Note that Sipser^^^ is the top gate of Sipser^; in particular, Sipser^^^ is an ruo-way OR if d is even, 
and an rcQ-way AND if d is odd. Note also that Sipser^*^^ is simply Sipser^ itself, although we stress 
that Sipser^^^ is not the same as Sipser^. for 1 < k < d — 1. 
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7 Setup for and overview of our proof 

7.1 Key parameter settings 

The starting point for our parameter settings is the pair of fixed values 

^^^(log^ and g:=VP = 2-™/2. (7) 

Given these fixed values of A and we define a sequence of parameters td_i,..., as 

p->^ . (i-4r-A , , , , ^ 

:=-, 4-1 :=- for A: = d-(8) 

q q 

Each of our d — 1 random projections will be defined with respect to an underlying product 
distribution. Our first random projection projp(d) will be associated with the uniform distribu¬ 
tion over {0,1}”; this is because our ultimate goal is to establish uniform-distribution correlation 
bounds. For A: G {2,..., d — 1} the subsequent random projections projp(fe) will be associated with 
either the Afc-biased or (1 — t^j-biased product distribution (depending on whether d — A: is even 
or odd). Recalling our discussion in Section 4 of the framework for proving correlation bounds 
— in particular, the three key properties our random projections have to satisfy — the values for 
ti,..., td-i are chosen carefully so that the compositions of our d — 1 random projections complete 
to the uniform distribution, satisfying Property 3 (we prove this in Section 8). 

The next lemma gives bounds on 4_i,..., 4 which show that these values “stay under control”. 
By our definitions of X,p and q in (7), we have that td-i = q — o{q), and we will need the fact 
that the values of tfc for k = d — 1,... ,2 remain in the range q ± o{q). Roughly speaking, since 
each tk-i is defined inductively in terms of tk from A: = d — 1 down to 1, we have to argue that 
these values do not “drift” significantly from the initial value of td-i = q — o{q). We need to keep 
these values under control for two reasons: first, the magnitude of these values directly affects the 
strength of our Projection Switching Lemma — as we will see in Section 9.1, our error bounds 
depend on the magnitude of these 4’s. Second, since the top fan-in wq of our Sipser^ function 
is directly determined by 4 (recall (4)), we need a bound on ti to control the structure of this 
function. 

Lemma 7.1. There is a universal constant c > 0 such that for 2 < d < , , we have that 

tk = q q^'^ for all k € [d — 1]. 

We defer the proof of Lemma 7.1 to Appendix A. The A; = 1 case of Lemma 7.1 along with our 
definition of tco (recall (4)) give us the bounds 

1 > (1 - ti)™ > 1 (1 - t,,) = 1 (l - 511^^ = i (1 _ 9(2-’")) . (9) 

These bounds (showing that (1 — is very close to 1/2) will be useful for our proof in Sec¬ 

tion 10.2 that Sipser^ remains essentially unbiased (i.e. it remains “structured”) under our random 
projections, which in turn implies our claim (6) that Sipser^ is essentially balanced (see Remark 17). 

We close this subsection with the following estimates of our key parameters in terms of w for 
later reference: 

P = e(!^). 9 = e(/!^). = foraiueM-ij. (lo) 


16 












7.2 The initial and subsequent random projections 


As described in Section 4, our overall approach is structured around a sequence of random pro¬ 
jections which we will apply to both the target function Sipser^ and the approximating circuit C. 
Both are functions over {0,1}” = {0,1}"^'', and our d — 1 random projections will sequentially 
transform them from being over {0, to being over {0, l}^fc-i for k = d down to A: = 1. Thus, 
at the end of the overall process both the randomly projected target and the randomly projected 
approximator are functions over {0, = {0,1}”'°. 

We now formally define this sequence of random projections; recalling Definition 4, to define a 
random projection operator it suffices to specify a distribution over random restrictions, and this 
is what we will do. We begin with the initial random projection; 

Definition 6 (Initial random projection). The distribution TZ\-ait over restrictions p in {0,1, 

( 0 , 1 , *}” (recall that Wd-i = m) is defined as follows: independently for each a G Ad-i, 

{ { 1 }”^ with probability A 

{*1/2,11/2}”'\ {I}”" with probability q ( 11 ) 

{O1/2,11/2}™' \ { 1 }™' with probability 1 — X — q. 

Remark 10. The description of TZimt given in Definition 6 will be most convenient for our argu¬ 
ments, but we note here the following equivalent view of an T^init-random projection. Let be 
the distribution over restrictions p' in {0, 1, = {0, 1, *}” where 

^ {*1/2,11/2}™' \ {I}™ independently for each a G A^-i, 

and be the distribution of restrictions p” in {0, where 


1 with probability A 

* with probability q independently for each a G A^-i- 

0 with probability 1 — X — q 


Then for all / : {0,1}” —)> {0,1} we have that proj^/, where p G- 77init, is distributed identically 
to 


(projp/ /) f p" where p' ^ and p" ^ 77" 


7.2.1 Subsequent random projections 

Our subsequent random projections will alternate between two types, depending on whether d—k is 
even or odd. These types are dual to each other in the sense that their distributions are completely 
identical, except with the roles of 1 and 0 swapped; in other words, the bitwise complement of a draw 
from the first type yields a draw from the second type. To avoid redundancy in our definitions we 
introduce the notation in Table 2: we represent {0,1}"^'= as {•, where a o-value corresponds to 
either 1 or 0 depending on whether d—k is even or odd, and the "-value is simply the complement of 
the o-value. For example, the string (o, o, •, o) translates to (1,1,0,1) ii d — k is even, and (0,0,1,0) 
if d — A: is odd. 

In an interesting contrast with Hastad’s proofs of the worst-case depth hierarchy theorem (The¬ 
orem 4) and of Parity ^ AC°, our stage-wise random projection process is adaptive: apart from 
the initial 77init-random projection, the distribution of each random projection depends on the out¬ 
come of the previous. We will need the following notion of the “lift” of a restriction to describe 
this dependence: 
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Gates of Sipser^ at depth k — 1 

0 

• 

d — k = 0 mod 2 

AND 

1 

0 

d — k = 1 mod 2 

OR 

0 

1 


Table 2: Conversion table for r G {•, 0 ,*}^'= where \ < k < d. 


Definition 7 (Lift). Let 2 < k < d and r G {•, o, = {•,0,*}"^'=. The lift of r is the 

string r G {•, o, defined as follows: for each a G Ak_i, the coordinate fa off is 


° ifTa,i = • for any i G [rafc-i] 
<• ifra = {or^-^ 

i/Ta G {*,o}"'fc-i \ {o}’^fe-i. 


ITe remind the reader that r G {•,0,*}"^'' and f G {•, o, belong to adjacent levels (i.e. they 

fall under different rows in Table 2). Consequently, for example, if 1 corresponds to • as a symbol 
in T then it corresponds to o as a symbol in f, and vice versa. 

Later this notion of the “lift” of a restriction will also be handy when we describe the effect of 
our random projections on the target function Sipser^. The high-level rationale behind it is that 
T G {•, o, denotes the values that the bottom-layer gates of Sipser^^^ take on when its input 

variables are set according to r G {•,0,*}"^'=. As a concrete example, suppose d — k = 0 mod 2 
and let r G {0,1, be a restriction. Since d — k = 0 mod 2, recalling Table 2 we have that the 
bottom-layer gates of Sipser^^^ (or equivalently, the gates of Sipser^ at depth A: — 1) are AND gates. 
For every block a G A^-i, 

- If Ta^i = 0 for some i G [rcfc_i], the AND gate at address a is falsified and has value 0. 

- If Ta^i = {1}"''=-!, the AND gate at address a is satisfied and has value 1. 

- If Tq G {*, 1} \ {1}'^'““!, the value of the AND gate at address a remains undetermined (which 
we denote as having value *). 

These three cases correspond exactly to the three branches in Definition 7, and so indeed fa G 
{0,1, *} represents the value that the AND gate at address a takes when its input variables are set 
according to Ta G {0,1, 

We shall require the following technical definition: 

Definition 8 (/c-acceptable). For 2 < k < d—1 and a set S C [rcfc_i], we say that S is /c-acceptable 

if 

|5| = qw ± where fifk, d) := ^ + 

Note that \ < I3{k,d) < -^ < \ for all d G N and 2 < k < d — 1. 

For intuition, in the above definition S should be thought of as specifying those children of a 
particular depth-(A: — 1) gate of Sipser^ that take the value * under certain restrictions (defined 
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below). We want the size of this set to be essentially qw, and as k gets smaller (closer to the root), 
for technical reasons we allow more and more — but never too much — deviation from this desired 
value. See Section 10.1 for a detailed discussion. 

We are now ready to give the key definition for our subsequent random projections: 

Definition 9 (Subsequent random projections). Let r G {•, o, where 2 < k < d — 1. We 
define a distribution TZ{t) over refinements p G {•,o, of t as follows. Independently for each 
a G Ak_i, writing Sa = *S'a(r) to denote = {r G [wk-i] '■ Ta^i = *} and p{Sa) to denote the 

substring of p^ with coordinates in Sa, 

- Iffa = o (i.e. if Ta,i = • for some i G \wk-i]) or if Sa is not k-acceptable, then 

p{Sa) ^ 

- If fa = * (i.e. if Ta^i G {*, \ and Sa is k-aceeptable, then 


with probability X 
p{Sa) ^ \ {o}‘S'“ with probability Qa 

{• 4 , 01 - 4 }'^“ \ {o}‘S'“ with probability 1 - X - qa, 


where 


qa ■■= 


:= - - — - is chosen to satisfy (1 — = A + qatk-i. 

tk-i 


( 12 ) 


(13) 


(Note that iffa = » then Ta,i = o for all i G [rcfc-i], and so Ta cannot be refined further.) 

For all a G Ak-i and i G [rcfc-i] such that Ta^i G {•,0}, we set pa^i = Ta^i and so p is indeed a 
refinement of t. 

Remark 11. We remark that qa as defined in (13) is indeed a well-dehned quantity in [0,1] if Sa is 
/c-acceptable. We omit the straightforward verification here since our analysis in Section 10.1 will 
in fact establish a stronger statement showing that qa = qf o{q)\ see Lemma 10.5. 

Remark 12. By inspecting Definition 6, we see that for all p G supp(7^init) and blocks a G Ad-i 

Pa,i = * for some i G [m] iff pa G {*, 1}™ \ {1}™', or equivalently, 

Pa,i = * for some i G [m] iff pa = *, 

and hence for all h : {0,1}" —)• {0,1} the projection proj^/i : {0,1}"^'^-^ {0)1} depends only 

on the coordinates in (p)“^(*) C A^-i. Likewise, by inspecting Definition 9 we have that for all 
r G {•,0,*}"^'',/) G supp(7^(r)), and blocks a G Ak-i, 

Pa,i = * for some i G [rcfc-i] iff Pa £ {*,o}"''=-i \ {o}"''=-i, or equivalently, 

Pa,i = * for some i G [wk-i] iff Pa = *, 

and hence for all h : {0,1}"^'“ —)• {0,1} the projection proj^/i : {0, depends only on 

the coordinates in (p)~^(*) C Ak-i. Our proof that our sequence of random projections (based on 
Definitions 6 and 9 as described in Definition 4) completes to the uniform distribution will rely on 
these properties; see Section 8. 


19 



7.3 Overview of our proof 

With the definitions from Section 7.2 in hand, we are (finally) in a position to give a detailed 
overview of our proof. Let C be a depth-d approximating circuit for Sipser^, where C either 
has significantly smaller bottom fan-in than Sipser^ (in the case of Theorem 6) or the opposite 
alternation pattern to Sipser^ (in the case of Theorem 7), and C satisfies the size bounds given in 
the respective theorem statements. In both cases our goal is to show that C has small correlation 
with Sipser^, i.e. to prove that 

Pr[Sipser^(X) / C(X)] > ^ - o.(l) (14) 

for a uniform random input X At a high level, we do this by analyzing the effect 

of d — 1 random projections on the target and the approximator: we begin with an TT-init-random 
projection projp(d) where •(— TZirat, followed by projp(d-i) where 77(p(‘^)), and then 

projp(d- 2 ) where 77(p('^“^)), and so on. It is interesting to note that unlike Hastad’s proofs 

of the worst-case depth hierarchy theorem (Theorem 4) and of Parity ^ AC°, the distribution of 
our k-th. random projection is defined adaptively depending on the outcome of the {k — l)-st. For 
notational concision we introduce the following definition for this overall (d — l)-stage projection: 

Definition 10. Given a function f : {0, !}"■ — ^ {0, 1}, we write ^{f) : {0, 1}'^° — ^ {0, 1} to denote 
the following random projection of f: 


^(/) = projp(2) projpO) • • • projp(d-i) proj^cd) /, 

where ^ 'R-mit and p^^'^ ^ 77(p(^+^)) for all 2 < k < d — 1. We will sometimes refer to the 
overall process as a ^-random projection, and ^{f) as the ^-random projection of f. fWe remind 
the reader that the projection of a function over {0 ,yields a function over {0, for all 

2 < k < d, and in particular ^(/) is indeed a function over {0,1}"^^ = {0, 

Recalling the framework for proving correlation bounds discussed in Section 4, the rest of the 
paper is structured around showing that a T^-random projection satisfies the three key properties 
outlined in Section 4: 

Property 1. The approximating circuit C simplifies under a T^-random projection. 

Property 2. The target Sipser^ remains structured under a T^-random projection. 

Property 3. 'J' completes to the uniform distribution. 

Section 8. We begin in Section 8 with Property 3. We show that 

Pr[Sipser,(X) / C(X)] = Pr[(T'(Sipser,))(Y) / (^'(C))(Y)] (15) 

where Y is drawn from an appropriate product distribution D over {0, l}"'^ (T> is the ti-biased 
product distribution if d is even, and (1 — ti)-biased product distribution if d is odd). This reduces 
our goal of bounding the correlation between Sipser^ and C (i.e. (14)) under the uniform distribu¬ 
tion, to the task of bounding the correlation between their T^-random projections T'(Sipser^) and 
T'(C) with respect to V. 
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Section 9. With the reduction (15) in hand, we turn our attention to Property 1, showing that 
the approximating circuit C of the type specified in either Theorems 6 or 7 “collapses to a simple 
function” under a 'I'-random projection. More precisely, for the case that the depth-d circuit C has 
significantly smaller bottom fan-in than Sipser^ we show that C collapses to a shallow decision tree, 
and for the case that C has the opposite alternation pattern to Sipser^ we show that C collapses to 
a small-width depth-two circuit with top gate opposite to that of ’J'(Sipser^). (In both cases these 
statements are with high probability under a 'I'-random projection.) 

In close parallel with Hastad’s “bottom-up” proof of Parity ^ AC°, the main technical ingredient 
in this section is a projection switching lemma showing that the random projection projp{fc) of a 
small-width DNF or CNF “switches” to a small-depth decision tree with high probability. Applying 
this lemma to every bottom-level depth-2 subcircuit of C, we are able to argue that each of the 
d — 1 random projections comprising ^ reduces the depth of C by one with high probability, and 
thus 'I'(C) collapses to a small-depth decision tree or small-width depth-two circuit as claimed. 

Section 10. It remains to argue that the target Sipser^ — in contrast with the approximating 
circuit C — “retains structure” with high probability under a ’^'-random restriction. This is a 
high-probability statement because there is a nonzero failure probability introduced by each of the 
d — 1 individual random projections projp(fe) that comprise (see Footnote 3 for 

an example of a possible “failure event” for one of these restrictions). To reason about and bound 
these failure probabilities, in Section 10.1 we introduce the notion of a “typical” restriction. The 
parameters of our definition of typicality are chosen carefully to ensure that 

(i) <r- T^init is typical with high probability, and 

(ii) if is typical, then p^^'> <r- 7^(p(^+^)) is also typical with high probability. 

We establish (i) and (ii) in Section 10.1 . Together, (i) and (ii) imply that with high probability 
is such that are all typical; we use this in Section 10.2. 

With the notion of typical restrictions in hand, in Section 10.2 we establish Property 2 showing 
that Sipser^ “survives” a 'J'-random projection (i.e. it “retains structure”) with high probability. 
More formally, for outcomes T = {p^^^}k&{ 2 ,...,d} of 'I' such that ..., are all typical, we 
prove that the T-projected target T(Sipser^) is “well-structured” in the following sense: 

(i) ^(Sipser^;) is a depth-one formula: an OR if d is even, an AND if d is odd. 

(ii) The bias of T(Sipser^) under V is close to 1/2; that is, 

bias(T(Sipserrf), Y) = ^ - 0^(1). 

Recall that we have shown in Subsection 10.1 that with high probability 'I' = {p^^^}k&{ 2 ,...,d} is such 
that ..., are all typical. Therefore, the results of these two subsections together imply 
that the randomly projected target ’J'(Sipser^) satisfies both (i) and (ii) with high probability. 

Section 11. Having established Properties 1, 2, and 3, it remains to bound the correlation 
between a depth-one formula with bias essentially 1/2 and a small-width CNF formula of opposite 
alternation with respect to the product distribution D over {0,1}"'°. (Recall that our results from 
Section 10.2 show that ’J'(Sipser^) collapses to the former with high probability, and our results 
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from Section 9 shows that ^(C) collapses to the latter with high probability — this holds in both 
cases since a shallow decision tree is a small-width CNF.) We prove this correlation bound using 
a slight extension of an argument in [OW07], and with this final piece in hand our main theorems 
follow from straightforward arguments putting the pieces together. 

8 Composition of projections complete to uniform 

Our goal in this section is to establish the following lemma: 

Proposition 8.1. Consider f, g : {0,1}'^ ^ {0,1}. Let X •(—{0i/2) Y 

if d is even, andY •(— {0*^, if d is odd. Then 

Pr[/(X) # g{X)] = Pr[(^'(/))(Y) / {^{g)){Y)]. 

As discussed in Section 7.3 we will ultimately apply Proposition 8.1 with / being our target 
function Sipser^ and g being the approximating circuit C. This allows us to translate the inap- 
proximability of ’J'(Sipser^) by ’J'(C') (either with respect to the ti-biased or (1 — ti)-based product 
distribution, depending on whether d is even or odd) into the uniform-distribution inapproximabil- 
ity of Sipser^ by C. 

Overview of proof. We will actually derive Proposition 8.1 as a consequence of a stronger claim, 
which, roughly speaking, states that we can generate a uniformly random input X ^ {0i/2) I 1 / 2 }"' 
via T' and Y in a stage-wise manner. In more detail, given Y we consider 

the following random {0,1, *}-valued labeling i of the leaves and non-root nodes of the depth-d 
depth-regular tree corresponding to the depth-d formula computing Sipser^: 

- The |Arf| = n leaves of the tree are each labeled {0,1, *} according to ^ 

- For 2 < k < d — 1, the \Ak\ nodes at depth k are each labeled {0,1, *} according to ^ 

- Finally, for each i G [rco] = [|Ai|], if p^‘^\ = * then the f-th node at depth 1 is labeled 
Yj G {0,1}, and otherwise it is labeled G {0,1}. (The root of the tree is left unlabeled.) 

Next, we let the {0, l}-valued labels of I “percolate down the tree” as follows: every node or 
leaf that is labeled * by £ inherits the ({0, l}-valued) label from its closest ancestor that is not 
labeled *. Note that this “percolation step” ensures that every leaf and non-root node of the tree 
is labeled either 0 or 1, since every depth-1 node is assigned a {0, l}-valued label by i. 

Let denote this {0, l}-valued random labeling of the leaves and non-root nodes. Our main 
result in this section. Proposition 8.4, can be viewed as stating that the random string X G {0,1}"" 
defined by £'*'’s labeling of the n leaves is distributed uniformly at random; Proposition 8.1 follows 
as a straightforward consequence of this claim along with our definition of projections. 

We begin with the following lemma, which explains our choice of in (8) in the definition of 
TT-init (Definition 6). (Note that in the lemma each coordinate of Y is distributed as li^_i} 

regardless of whether d is even or odd; this is because of our convention that the bottom-layer gates 
of Sipser^ are always AND gates.) 
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Lemma 8.2. Let p <— TZmit and Y <— , and consider the string X G 

{0,1}” = {0, l}"^d-ixM (j 5 follows: 

'^ai = i , for all a G Arf_i and i G [ml. 

’ [ otherwise 

The string X is distributed according to the uniform distribution {^ 1 / 2 -, ^ 1 / 2 }^■ (Recalling Remark 12 
we have that Pai = * if and only ifPa = *; and so Ya in the equation above is indeed well-defined.) 

Proof. Since the blocks of p are independent across a G A^-i and the coordinates of Y are in¬ 
dependent across a G (p)~^(*) C Ad-i, it suffices to prove that Xq is distributed according to 
{O 1 / 2 , 11 / 2 }™' for a fixed a G Ad-i. We hrst observe that 

Pr[Xa = l^]=X + q Pr[Y, = 1] = A + qt^-i =p = 2“™, 


where the A is from the first line of (11), the gPr[Ya = 1] is from the second line of (11), and the 
penultimate equality is by our choice of td-i in (8). Next, for any string Z G {0,1}”^ \ {1}™', we 
have that 

n — m n — m 

Pr[X„ = Z] = (1 - A - (?) • + gPr[Y, = 0] • (16) 

= • ((1 - A - g) -h g(l - td-i)) 

I — P 

= • (1 - A - qtd-i) = ^.{l-p)=p = 2—, (17) 

where the first summand on the RHS of (16) is by the third line of (11), the second summand is 
by the second line of (11), and (17) again uses our choice of td-i in (8). Since this is exactly the 
probability mass function of the uniform distribution {O 1 / 2 , 11 / 2 }™^! the proof is complete. □ 


The following lemma, the analogue of Lemma 8.2 for 77(r), explains our choice of qa in terms 
of tk and tk-i in (13): 

Lemma 8 . 3 . For 2 < k < d — 1 let t € { 0 , 1 , p R-in), and 


Y^ {0i_4_,,l4_j(p) 'W ifd-k = 0 mod2 
Y^ {04_,,li_4_j(p)-^W ifd-k=l mod 2. 


For each a G A^-i, writing Sa = Sair) to denote tT^(*) = {i G [wk-i]. Ta,i = *} and p{Sa) to 
denote the substring of p^^ with coordinates in Sa, we consider the string Zq G { 0 , 1 }'^“ defined as 
follows: 


Z 


a,i 


Ya ifpa,i = * 

Pa i otherwise 


for all i € Sa. 


The string Z^ is distributed according to 


{04,11-4}'^“ ifd-k = 0 mod 2 
{0i_4,l4}‘^“ ifd-k = l mod 2, 


and furthermore, Z^ and Z^' are independent for any two distinct a, a' G A^-i. (Again, recalling 
Remark 12 we have that pa^i = * if and only ifpa = *, and so Yq in the equation above is indeed 
well-defined.) 
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Proof. We prove the d — k = 0 mod 2 case (the other case follows by a symmetric argument). If 
fa falls in the first case of Definition 9 (i.e. if = 0 or if Sa is not fc-acceptable) then the claim 
is true since Zq = p{Sa) ^ Otherwise, if fa falls in the second case of Definition 9 

(i.e. \ifa = * and Sa is fc-acceptable) we first observe that 

Pr[Z, = l®“] = A + gaPr[Y, = l] 

= A + Qatk-l 


where as before the A is from the first line of (12), the qa Pr[Ya = 1] is from the second line of (12), 
and the final equality is by our definition of qa in (13). Next, for any string Z G {0,1}'^“ \ {1}'^“ 
and u := |Z~^(0)| G {1,..., IS'a]}, we have that 


PrlZ„-Z|-(l A 


+ gaPr[Ya = 0] 


1 - (1 - 


(18) 


fU(l _ 

= few 

©I - 4)1®"'-“ „ > 1 

= ■ (1 - (1 - = ‘sa - 


(19) 


where as before the first summand on the RHS of (18) is by the third line of (12), the second 
summand is by the second line of (12), and (19) again uses our definition of qa- Therefore indeed, 
the resulting string is distributed according to {Otj., li_tj,}'^“. Finally, since the blocks of p are 
independent across a G Ak-i and the coordinates of Y are independent across a G (p)“^(*) C Ak-i, 
we have that Z(j and Z^' are independent for any two distinct a, a' G Ak-i- □ 


Together Lemmas 8.2 and 8.3 give us the following proposition, which in turn yields Proposi¬ 
tion 8.1, our main result in this section. 

Proposition 8.4. Let p^'^^ <— IZinit and p^^'> ^ 7^(p(*^+^)) for 2 <k < d — 1. Let 


yT) 


if diseven 
if d is odd, 


and for 2 < k < d — 1 consider random strings Y^^) G {0, ^(*) defined induetively from 

k = 2 up to d — 1 as follows: 

f f — * _ 

^ for all a G Ak-i and i G [tCfc-i] s.t. = *. (20) 

„ ■ otherwise 




Pa'I 


Then the string X G {0,1}"' = {0, defined by 

Xa,i = 




' for all a G Ad-i and i G [m\ 

p\ I otherwise 


is distributed according to the uniform distribution {O 1 / 2 , 11 / 2 }”'. 
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Proof. By the k = 2 case of Lemma 8.3, for all possible outcomes of for 3 < k < d, 
conditioned on such an outcome the random string is distributed according to {O^j, 

if d is even and according to {0i_42, ^ 12 }^ if d is odd. Applying this argument repeatedly and 
arguing inductively from A: = 2 up to /c = d — 1, we have that conditioned on any outcome of 

p^'^) TZin\\,i the random string is distributed according to The claim 

then follows by Lemma 8.2. □ 

Proof of Proposition 8.1. Recall that X ^ {O 1 / 2 , 11 / 2 }"' and Y ^ if d is even, Y ^ 

{Ot^, if d is odd. Let ^ Tlirat and p^^^ ^ 7^(p(^+^)) for 2 < /c < d—1. For 1 < A: < d—1 

let Y(^) G {0, ^(*) be defined as in Proposition 8.4. Recalling Remark 12, for all functions 

h : {0,1}” {0,1} and 1 < /c < d — 1, the random projection 

(projp(fc+i) • • • projp{d) h) : {0,1}^'= ^ {0,1} 

depends only on the coordinates in (p(^+^))“^(*) C A^., and so we may equivalently view it as 

a function —)■ {0,1}. By Proposition 8.4, the definition of the Y^^^’s, and the 

definition of projections, we see that 

Pr[/(X) ^ p(X)] = Pr[(proj^(.) f){Y^^-^^) / (proj^f.) p)(Y(''-1))] 

= Pr[(projp{d-i) projp(d) f){Y^‘^-^'>) / (projp(d-i) projp{d) g){Y^‘^-^'>)] 

= Pr[(proj p{ 2 ) • • • projp(d) /)(Y(^)) ^ (proj p(2) • • -projpCd) p)(Y(^))] 

= Pr[(projp( 2 ) • • • projp{d) /)(Y) / (projp( 2 ) • • • proj^cd) g){Y)] 

= Pr[{^{f)){Y)^{^{g)){Y)] 

where the final inequality is by the definition of 'I' (Dehnition 10). □ 

9 Approximator simplifies under random projections 

With Proposition 8.1 in hand we next prove that the approximating circuit C of the type specified 
in either Theorems 6 or 7 “collapses to a simple function” with high probability under a 'J'-random 
restriction. For the case that the depth-d circuit C has significantly smaller bottom fan-in than 
Sipser^ we show that C collapses to a shallow decision tree with high probability, and for the case 
that C has the opposite alternation pattern to Sipser^ we show that C collapses to a small-width 
depth-two circuit with top gate opposite to that of ’J'(Sipser^) with high probability. 

We do so via a projection switching lemma, showing that each of the d — 1 individual random 
projections projp(fc) comprising “contribute to the simplification” of C with high probability. We 
state and prove our projection switching lemma in Sections 9.1 through 9.5, and in Section 9.6 we 
show how the lemma can be applied iteratively to prove our structural claims about ^(C). 
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9.1 The projection switching lemma and its proof 

Proposition 9.1 (Projection switching lemma for Let F : {0,1}” —)> {0,1} be a depth-2 

eircuit with bottom fan-in r. Then for all s >1, 


Pr [proj „ F is not a depths decision tree\ = 

P'^ ^init 


(o(^r2^ 


S 


Proposition 9.2 (Projection switching lemma for TZ{t)). Let 2 < k < d — 1 and F : {0,1}^'= —)> 
{0,1} be a depth-2 circuit with bottom fan-in r. Then for all r G {0,1, = 1 =}"^'' and s > 1, 


Pr [proj„F is not a depths decision tree\ = 

PsTI{t) 





S 


The proofs of Propositions 9.1 and 9.2 have the same overall structure, and they share many of 
the same ingredients. We will only prove (the slightly more involved) Proposition 9.2, and at the 
end of this section we point out the essential differences in the proof of Proposition 9.1. 

Furthermore, we will prove Proposition 9.2 assuming that F is a DNF and d — k = 0 mod 2. 
Both assumptions are without loss of generality. (For the first, we recall that F is a width- 
r DNF if and only if its Boolean dual F^ is a width-r CNF, and that a Boolean function is 
computed by a depth-s decision tree if and only if its Boolean dual is as well, and we observe that 
(projp F)^ = projp (F^^) for all p and all F. For the second we note that the definition of 'F.(r) when 
d — k = 0 mod 2 is dual to that of 'F.(r) when d — k = 1 mod 2, and so applying the former to 
F{x) is equivalent to applying the latter to F{x).) 


Overview of proof. At a high level, we adopt Razborov’s strategy in his alternative proof [Raz95] 
of Hastad’s Switching Lemma. We briefly recall the overall structure of Razborov’s argument. Given 
a DNF F : {0,1}” —)• {0,1} and a distribution TZ over restrictions in {0,1, *}”, we let B C {0,1, *}” 
denote the set of all bad restrictions, namely the ones such that F ( p is not computed by a small- 
depth decision tree. Our goal in a switching lemma is to bound Prp^ 7 ^[p G B], the weight of B 
under TZ. To do so, we define an encoding of each bad restriction p € B as a different restriction 
p' G {0,1, *}” and a small amount (say at most i bits) of “auxiliary information”: 

encode : F —)> {0,1, *}” x {0,1}^ 
encode(p) = (p^, auxiliary information). 

This encoding should satisfy two key properties. First, it should be uniquely decodable, meaning 
that one is always able to recover p given p' and the auxiliary information; equivalently, the function 
encode(-) is an injection. Second, the weight Prp<_ 7 ^[p = p'] of p' under TZ should be larger than 
that of p by a significant multiplicative factor (say by a factor of F). It is not hard to see that 
together, these two properties imply that total weight of all bad restrictions with the same auxiliary 
information is at most 1/F. To complete the proof of the switching lemma, we then bound the 
overall weight of B via a union bound over all 2^ possible strings of auxiliary information. (For a 
detailed exposition of Razborov’s proof technique see [Bea94, Tha09] and Chapter §14 of [AB09].) 

The proof of our projection switching lemma follows this high-level strategy quite closely; specif¬ 
ically, we build off of a reformulation (due to Thapen [Tha09]) of Hastad’s proof of the blockwise 
variant of his Switching Lemma in Razborov’s framework. In Section 9.3 we define our encoding, 
specifying the restriction p' and auxiliary information that is associated with every bad restriction 
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p; in Section 9.4 we prove that our encoding is an injection by describing a procedure for unique 
decoding; in Section 9.5 we verify that every bad p is indeed paired with a p' whose weight under 
TZ{t) is much larger, and show how this completes the proof of our projection switching lemma. 

One important aspect in which we differ from Hastad’s and Razborov-Thapen’s proof — and 
indeed, this is the key distinction between our projection switching lemma and previous switch¬ 
ing lemmas — is that we will be concerned with the complexity of the randomly projected DNF 
projpF = proj {F \ p), rather than the randomly restrieted DNF F \ p. Recalling our definition 
of projections (Definition 4) and Remark 9 in particular, we see that the decision tree depth of 
proj {F \ p) can in general be significantly smaller than that of F f p, since groups of distinct 
formal variables {xa,i- i G [tc]} of F ( p get mapped to the same formal variable pa under the 
projection operator. As we will see, the proof of our projection switching lemma crucially exploits 
this fact. 


9.2 Canonical projection decision tree 


To emphasize the fact that the DNF F and its random projection proj^ F are over two different 
spaces of formal variables, we will let X = {xa^i- a G Ak-i,i G [wk-i]} denote the formal variables 
of F, and y = {pa'- a G Ak-i} denote the formal variables of proj^F. For notational clarity, from 
this section through Section 9.4 we omit the subscripts on Ak_i and Wk-i and simply write A 
and w. 


Definition 11. Let G : {0,{Q, 1} he a DNF over X and T be a term in G. We sap 
that a variable Xa,i occurs positively in T ifT contains the unnegated literal Xa,i, and that it occurs 
negatively in T ifT contains the negated literal Xa,i ■ We sap that Xa,i occurs in T if it either occurs 
positivelp or negativelp in T. 

Definition 12. For anp rj F y and assignment vr G {0, 1}^, the restrietion ( 771 —)• tt) G {0,1, 
to the variables in X is defined as follows: for all a G A and i G [rc], 


{p ^ TT)a,i = 


7r(ya) ifVa^V 
* otherwise. 


We stress that for a given a, the value of (p 1 —)• 7:)a,i is independent of the value of f G [w]. 
Next, we define a procedure which, given any DNF G over X, returns a “canonical” decision 
tree ProjDT(G) over y computing its projection proj G. The proof of our switching lemma will 
establish that the depth of ProjDT(F ( p) is small with high probability; this clearly implies that 
the decision tree depth of proj^F = proj (F ( p) is small with high probability. (We remark that 
both Hastad’s and Razborov’s proofs of Hastad’s Switching Lemma consider an analogous notion of 
a canonical decision tree whose depth they bound; in their context, however, the canonical decision 
tree computes the DNF itself, whereas the canonical decision tree we now define computes the 
projection of the DNF.) 

Definition 13 (Canonical projection decision tree). Let G : {0,1}^^[“'] —)• {0,1} be a DNF over 
X, where we assume a fixed but arbitrarp ordering on its terms, and likewise on the literals within 
each term. The canonical projection decision tree ProjDT(G) : {0,1}"^ —>• {0,1} associated with G 
is defined recursivelp as follows: 

1. IfG = 1 (i.e. ifG{X) = 1 for allX G {0 ,) output the trivial decision tree ProjDT(G) = 
1, and likewise, if G = 0 output ProjDT(G) = 0. 
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2. Otherwise, let T be the first term in G sueh that T ^ 0, and let 

V = {Va- Xa,i occurs in T for some i &[w\] 

3. ProjDT(G) queries all the variables in r] in its first \r]\ levels. 

4- For each path tt G {0,1}^, recurse on G \ {rj ir). 

We stress that while G is a DNF over the variables in X, the canonical projection decision 
tree ProjDT(G) queries variables in y. The following fact is a straightforward consequence of 
Dehnition 13: 

Fact 9.3. ProjDT(G) computes pio] G. 

9.3 Encoding bad restrictions 

Fix T G {0,1, and consider 

;B = {/9 G {0,1, : p refines r and proj^F is not a depth-s decision tree}, 

We call these restrictions p £ B bad, and recall that our goal is to bound Pr[p G B] for p TZ{t). 
Fix a bad restriction p £ B. It will be convenient for us to adopt the equivalent view of proj^ F 
as proj {F \ p) in this section. Since proj {F \ p) is not computed by a depth-s DT over y, this in 
particular implies that the canonical projection decision tree ProjDT(F \ p) has depth at least s 
(recall by Fact 9.3 that ProjDT(F \ p) computes proj {F \ p)), and so we may let tt G {0,1}-^ be 
the leftmost root-to-leaf path of length at least s in ProjDT(F \ p). 

We now define a few objects associated with p and tt: for some 1 < j < s, we dehne 

- A collection of terms Ti,... ,Tj in F. 

- Disjoint sets of variables r]i,...,r]j C y, and for each such rp, a bit string encode( 77 £) G 

{ 0 , l}l'?d(logr'+l)_ 

- A restriction a = £ {0,1, such that (T“^({0, 1}) C (i.e. a only sets 

to constants variables left free by p). 

- Disjoint sets of variables 71 ,..., 7 ^ C X, and for each such 7 ^, a bit string encode( 7 £) of 
Hamming weight 17^1 and length r. 

- A decomposition of the length-s prefix tt' = vr^vr^ • • • vr-^ G {0,1}^ of tt. 

These objects are defined inductively starting from i = 1 up to i = j, where j £ [s] is the smallest 
integer such that the rji's as defined below satisfy \pi U ■ ■ ■ U T]j\ > s. For £ £ [j], 

- T£ is the hrst term in F such that \ p {iji 1 —)• vr^) • • • (%_i 1 —)• tt^~^) ^ 0 and 

Vi = {Ua ■ Xa,i occurs in [ p (71 tt^) • • • 1 -^ tt^~^) for some i G [u;]} C T- 

We define encode(p£) G {0, as follows: for each pa £ Vi, we use log < logr bits 

to encode the location of Xa^ii in Ti, where 

ii := min G [rc]: Xa,i occurs in ( p (pi 1 —)■ vr^) • • • (%_i e-)■ 7r^“^)}, 

along with a single bit to indicate whether ya is the last variable in 
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- Let G {0,1, be defined as follows: for each ya G and f G [rc], 

{ 1 if Xa,i occurs positively in f p (r/i i—)• vr^) • • • {rji-i e-)• 

0 if Xa,i occurs negatively in f p (r/i tt^) • • • 

0 if pa^i = * and Xa^i does not occur in f p (ryi i—)■ vr^) • • • (r/£_i i—>■ 7r^~^). 

(Note that if Xa,i occurs in \ p{pi ^ ir^) ■ ■ ■ [pi^i i—)• then certainly pa^i = *.) All 

remaining entries of not specified above have value 

We make a few observations that will be useful for us later. First observe that for every G %, 

(i) {p (yi TT^) • • • (%_i 7r^“^))a = Pa 

since ya ^ ^lU - • -Uy^-i. Furthermore, writing Sa = Sa{j) to denote = {f G [tc]: Ta^i = *} 

and p{Sa) to denote the substring of pa with coordinates in Sa-, we claim that for every pa G yr, 

(ii) raG{*,ir\{ir, 

(iii) the set Sa is /c-acceptable, 

(iv) p{Sa) G {*, 1}-^“ \ {1}^“ (and hence pa G {*, 1}"’ \ {!}"' by (ii)), 

(v) (ya^)(5a)G{0,l}^“, 

(vi) (y(5a))-i(l)C((pa^)(5a))-i(l). 

To see this, hrst note that since pa G % it must be the case that pa,i = * for at least one i £ Sa, 
and by inspecting (12) of Dehnition 9 we have that indeed (ii), (iii), and (iv) hold. Claims (v) 
and (vi) follow from the fact that is defined so that G {0,1} iff pa,i = *. These claims 
will be useful for us later in the proof of Lemma 9.7. 

Second, we claim that 

Te r p (yi ^ vr^) • • • (yr-i ^ TT^~^)a^ = 1. (21) 

To see this, we note that every variable that occurs in term \ p(yi i—)■ tt^) • • • (y£_i i—)■ 
is fixed by a^, and furthermore, each is fixed in the unique way so as to satisfy the term. This 
will be useful for us later in the proof of Proposition 9.4. 

ll = {Xa,i ■■ cri^i = 1} c T’, 

and let encode( 7 £) be the string encode( 7 r) G {0,1}'’ of Hamming weight 17^1 and length r 
indicating the location of the elements of within T^. 

- Let vr^ be the length-ly^l substring of vr from index |yi U • • • U p£-i \ + 1 through |yi U • • • U y^l 
inclusive. 

After the final iteration i = j, if necessary, we trim rjj and vr-^ so that |yi U • • • U yj| = |7r^ • • • vr-^ l 
is exactly s, and redehne and 'jj appropriately. We refer the reader to Figure 1 and its caption 
for a concrete example and explanation of our encoding procedure. 
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T(, \ p (r}i ^ Ti'^) ■ ■ ■ {Pi-l ^ TT^ 

= ^a,l ^ ^a,8 ^ ^a',4 ^ ^a",2 


Pa,l Pa,8 



Figure 1: Let be the first term not falsified by p{r]i i—)■ vr^) • • • {r]£_i i—;■ 7r^“^), and suppose it 
evaluates to Xa,i A Xa,8 A Xa', 4 :/\Xa", 2 - In this example r]i will be the set {ya, ya',ya"} ^ 3^- Focusing 
on variables from the a-th block, we first recall our observation earlier that [p (r/i i—)• vr^) • • • {pi-i e-)• 
7r^“^))(j = Pa since pa ^ r/i U • • • U (Claim (i) in Section 9.4). Furthermore, as illustrated 
above, we have that pa G {*, 1}"' \ {!}“' and pa refines Ta G {*, 1}"' \ {l}’^ (Claims (ii) and (iv) of 
Section 9.4). 

Since Xa,i and Xa,8 occur in ( p (r/i i—)• vr^) • • • (p£_i i—>■ 7 r^“^) it certainly must be the case that 
Pa,i = Pa,8 = *; there may also be other coordinates i G [rc] such that pa,i = * and Xa,i does not 
occur in f p(pi i—)• vr^) • • • {pe-i i-A 7r^“^) (coordinates 2 through 7 in our example above). For 
i G [tc] such that pa,i = * and Xa,i occurs in f p(pi i—>• vr^) • • • (%_i e->• 7r^“^), the restriction 

fixes Xa,i so as to partially satisfy Ti f p (r/i i—>■ vr^) • • • (p£_i i—). 7r^“^); in our example above, 
(^^1 = 0 (since Xa,i occurs negatively) whereas a^g = 1 (since Xa,8 occurs positively). The remaining 
variables Xa, 2 , ■ ■ ■, Xa ,7 are set to 0 by yielding a completely fixed block (pcr^)a G {0,1}"' (Claim 
(v) in Section 9.4). Intuitively, we “break symmetry” and set these variables to 0 (rather than 1) 
so that the decoder will be able to “undo” them in {pa^)a without any auxiliary information: since 
Pa G {*, 1}"' \ {!}"', the decoder readily infers pa,i = * for all i G [tc] such that {p(T^)a,i = 0. And 
indeed, for the set 7 ^ C A of variables Xa,i that are set to 1 by we provide the decoder with the 
auxiliary information encode( 7 £) so that she is able to “undo” them in {pa^)a- 
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9.4 Decodability 

Let 7/ = r/i U • • • U ??j, encode(77) = encode(7?i) • • • encode(7/j) G {0, a = ■■ ■ cr^, 7 = 

7i U • • • U 7j, encode(7) = encode(7i) • • • encode(7j) G {0, 1}^®, and tt' = vr^ • • • tt-^ G {0,1}®. Our 
main result in this subsection is the following proposition: 

Proposition 9.4. The map 0 : 5 ^ x {0,1}® x {0, l}®(i+i°g^) x {0,1}^®, 

0(p) = (pa, vr', encode(r/), encode( 7 )), 


is an injection. 

Before proving Proposition 9.4, we state a slight extension of an observation made above in the 
definition of a^: 

Lemma 9.5. For all 1 < i < j — 1 we have 

Ti f p (f/i 1 -^ vr^) • • • (p£_i i-G 7r^“^) = 1, 

and when £ = j we have Tj ( p (pi 1—)• tt^) • • • {rjj-i 1—)• ^ 0. 

Proof. As we observed in the definition of above (c.f. ( 21 )), we have that is designed so that 

Tt [■ p (pi H- TT^) • • • (p£_i H- 7r^“^) cr^ = 1 , 

and certainly this remains true when the restriction is further extended by ■ ■ ■ aP We do not 
necessarily have this property for i = j due to our possible trimming of pj so that pi U • • • U pj has 
cardinality exactly s; this results in a redefinition of where some of its coordinates are set from 
(0, 1} back to *. However it is still the case that partially satisfies Tj ( p (pi 1 —)• vr^) • • • (pj-i 1 —;• 
vr-^"^), and hence Tj f p (pi 1 —)• vr^) • • • (pj-i e->• ^0. □ 

Proof of Proposition 9.j. We prove the proposition by describing a procedure that allows a “de¬ 
coder” to uniquely obtain p given (per, tt', encode(p), encode( 7 )). Recall that Ti is defined to be the 
first term in F not falsified by p. By Lemma 9.5, this remains true when p is extended by a: that is, 
the first term T{ in F such that T( f pa ^ 0 is precisely Ti itself. Therefore, given pa the decoder is 
able to identify Ti in F, and with Ti in hand she is able to then use encode(pi) and encode( 7 i) to 
recover pi and 71 respectively. Next, she “undoes” a^ in pa = pa^a^ ■ ■ ■ a^ and obtains pa^ ■ ■ ■ a^ 
as follows: for every pa G pi, she sets {pa)a,i back to * for all i G Ua, where 

Ua = {i e H: {pa)a,i = 0 or Xa,i G 71 }. 

To see that this indeed “undoes” a^, first recall that for every pa G pi, the restriction a^ is defined 
so that a{j G {0,1} iff paj = *, and furthermore, aR = 1 iff Xaj G 71 . (Recall the example in 
Figure 1.) Therefore, to obtain pa‘^ ■ ■ ■ a^ from pa^a‘^---a^, for every pa G pi and i G [re] the 
decoder sets (pa)a^i back to * if either {pa)a,i = 0 or Xaj G 71 . Finally, using vr^ G {0,1}’^^ she 
constructs the hybrid restriction p (pi 1 —)• vr^) a^ • • • aL 

By the same reasoning, for every 2 < £ < j the decoder is able to iteratively recover T£,rii,'yi, 
and TT^ from the hybrid restriction 

p (pi !-)• TT^) • • • (pi-i !-)• 7r^~^) a^ • • • aL 
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With this information she “undoes” within p [rji i— )• vr^) • • • (r/£_i i— )> , and constructs 

the next hybrid restriction 

p (r/i 1 -^ TT^) • • • (ry£ !-)• vr^) ■ ■ ■ aK 

Finally, having recovered p (771 1 —)• vr^) • • • {pj 1 —)• tt^) and p = pi U ■■■ U pj, the decoder will have all 
the information she needs to recover the actual restriction p: she sets {p {pi 1 —)• vr^) • • • (pj 1 —?• vr'^))a,i 
back to * for every Pa & p and i £ Ua- □ 

9.5 Proof of Proposition 9.2 

For all possible outcomes 7 ? 2 ) 'f ?4 of the second, third, and fourth coordinates of the map 9 defined 

in Proposition 9.4, we define 


= {p ^ 02{p) = '&2,93ip) = l?3} ^ S. 

^^2,'d3,'d4 = {p^ 6»2(p) = '02, Osip) = ^3,04ip) = ^ 4 } C 


We begin by bounding the probability that p £- 7^(r) belongs to for a fixed tuple 

i? 3 ) i? 4 )- The following fact, giving the probability mass function of Tlir), will be useful for 
us (its proof is by inspection of Definition 9): 

Fact 9.6. Fix r G {0,1,*}"^'“, and write Sa = Sair) to denote T“^(=t:) = {i £ [rcfc-i]: Ta,i = *}. 
Then Prp<_ 7 ^(T-) [p = p] = ^(p) for all p £ {0, 1, where f : {0, 1, *}^'= — >■ [0, 1] is the probability 

mass function: 

ap)= n UpiSa)), 

Sai=^ 

and p{Sa) denotes the substring of pa with coordinates in Sa, and Ca : {0,1, —>■ [0,1] is the 

probability mass function: 


Caio) = < 


(la ■ 




(1 - A - Pa) 


1 - (1 - 


1 - (1 - 


if Q = 


Lemma 9.7. For all 7?2)'f?3j^4; 


pSnir) ^ = {o{w 


1-4 


ll’?4|| 


where 1119411 denotes 1194^(1)1, the Hamming weight of'd^. 

Proof. Fix p £ B^ 2 ,^ 3 ,'& 4 - The restrictions p and 0i(p) = pa differ in exactly s blocks: these are 
the blocks a £ Ak_i such that pa £ p. Consider any such a £ and recall (as observed in the 

definition of a) that Sa is ^acceptable and p(Sa) £ {*, 1}'^“ \ {1}'^“ whereas {pa){Sa) £ {0, 1}'^“. 
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Let Afl denote |(p(t)^ ^(1)| — |p„^(l)|, the number of “new I’s” that a introduces into block a (note 
that as observed earlier we have that Aq > 0). By Fact 9.6, we have that 


C,a{{pcr){Sa)) 

Up{Sa)) 


qa -4)1^“!-^“ 

1-A-gq 

^ Qa \ J 


if {p(j){Sa) = {1}^“ 
if (pa)(5q)G{0,l}5^\{l5“}. 


( 22 ) 


Since Sa is fe-acceptable, we have that IS'al = qw ± and therefore 

^ - (1 - 

qtk-i + A 


< 


(1 - 

qtk-i + 


< 2g^ 


1 — 

where the equality is by (8) and the final inequality uses Lemma 7.1, (7) and (10). Since qa < 2g 
by Lemma 10.5, we may lower bound the quantity in the first line of (22) by 

/ 1 j. \ 

;l/4) / 




1-4 


4 


= ^(m^ 


V 4 


where we have used our choice of A in (7) and the estimates (10). Similarly, for the second quantity 
in the second line of (22) we have the lower bound 


1 ^ 7a /1 4 

Qa \ 4 


= n 


w 


log w 


l-tk 

4 


and so in both cases we may lower bound the ratio in (22) by 

Ca((P^)(<ga)) _ oL„l/4 f 1 -4 \^' 

UpiSa)) ^ 4 y 

Since Aq = ||i? 4 ||, it follows from Fact 9.6 that 


C{Gl{p)) _ C(pcj) _ -|-r Ca((ptj)(5’a)) 



(23) 


Finally, summing over all p G conclude that 


/ \s / f, \ 11^411 

Pr [p E i{p) = (o{w-^/^)) ( -- J ^{0i{p)) 






11^411 


Here the first inequality is by (23), and the second uses the fact that 6 is an injection (Proposi¬ 
tion 9.4), and hence any two distinct p, p' E H^ 2 ,i? 3 ,i ?4 lo distinct 0i{p),9i{p') E { 0 , 1 , 
so p p ^{^i{p)) is at most 1 since ^ is a probability mass function. □ 


33 

























Proposition 9.2 follows as a straightforward consequence of Lemma 9.7: 

Proof of Proposition 9.2. Summing over all G {0,1}'’^ and stratifying according to Hamming 
weight, we have that 


Pr [p G =1^1^ Pr JP ^ 13^2, 

\\M\=i 


rs 

sE 

i=0 




l-tk 


l-tk 




Taking a union bound over all 2* possible 'i ?2 £ {0)1}* and (2r)^ possible G {0, 
completes the proof. □ 


Proof of Proposition 9.1. For Proposition 9.1, we first observe that Proposition 9.4 also holds 
for p ^ TT-init (the proof is completely identical, with r being the trivial restriction {*}"). Proposi¬ 
tion 9.1 then follows as a consequence of Proposition 9.4 in a very similar manner (the calculations 
are in fact significantly simpler); we point out the essential differences in this section. We begin with 
the following analogue of Fact 9.6, specifying the probability mass function of TT-init (like Fact 9.6, 
its proof is by inspection of Definition 6): 

Fact 9.8. Prp<_ 7 ^.^.Jp = p] = ^(p) for all p G {0,1, (recall that Wd-i = rn), where 

f : (0,1, [0,1] is the probability mass function: 

?(P) = n 


and C ■ {0,1, —)■ [0,1] is the probability mass function: 


C{q) 


'a ifQ = {ir, 

.Q-T^ i/pG{*,l}™\{l}- 

1 — P 

(l-X-q).^ ifgG{0,ir\{ir 

1-p 


Fact 9.8 gives us the following analogue of (22): 

A(l-p) 

= < 


caper) 


C{Pa) 


qp 

1 — \ — q 

q 


if (pcr)a 

if {per)a 


= {l}m 

G{0,1}-\{1}™, 


and so by our choice of A in (7) and our estimates (10) this ratio is always at least . (Unlike 

the proof of Lemma 9.7, our lower bound here does not depend on = |(p(t)“^( 1)| — |p“^(l)|.) By 
the same calculations as in the proof of Lemma 9.7, we have the following analogue of Lemma 9.7: 
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Lemma 9.9. For all we have that Pr [p G ^^ 2,^3 , 194 ] = io{w ■ 

p-i-T^init ’ ’ V / 

Proposition 9.1 follows by a union bound over all 2^ possible £ {0,1}®, (2r)^ possible G 
{0, and 2^® possible 'i ?4 G {0,1}'’^ (unlike in the proof of Proposition 9.2 we do not have 

to stratify the union bound over 'd^ G {0,1}'"® according to Hamming weight). 

9.6 Approximator simplifies under random projections 

The main results of this section are Theorems 13 and 14. The first of these theorems says that 
any depth-d circuit whose size is not too large and whose bottom fan-in is signihcantly smaller 
than that of Sipser^ will collapse to a shallow decision tree with high probability under the random 
projection 'J' from Definition 10: 

Theorem 13. For 2 < d < , let C : {0,1}*^ —)> {0,1} be a depth-d circuit with bottom fan-in 

1 ^ 

at most and size S < 2"'® ‘‘ ^ . Then ^(C) is computed by a decision tree of depth 

with probability 1 — exp ( — )). 

The second theorem is quite similar; it says that under the random projection any depth-d 
circuit C that is not too large, regardless of its bottom fan-in, will collapse to a depth-2 circuit 
with bounded bottom fan-in and with top gate matching that of C: 

Theorem 14. For 2 < d < let C : {0,1}" —^ {0,1} be a depth-d circuit of size S < 

1 1 

22 " and unbounded bottom fan-in. 

1. If the top gate of C is an AND, then '^{C) is {1/S)-close (with respect to the uniform distri¬ 
bution on {0,1}"} to a width-n‘^(‘^-^') CNF with probability 1 — exp ( — )). 

1 

2. If the top gate ofC is an OR, then ^'(C') is {l/S)-close to a width-n‘^(‘^-^'> DNF with probability 

1 — exp ( — )). 

We first prove Theorem 13, which deals with depth-d circuits with bounded bottom fan-in. We 
state the following simple lemma explicitly for convenience of later reference: 

Lemma 9.10. Suppose that 3 < d < For 2 < k < d — 1 and i let C : {0, l}"^'=+i —^ 

{0,1} be a size-S depth-i circuit with bottom fan-in . For any r G {•, o, *}"^fe+i, with probability 
at least 1 — S ■ over p ^ H{j), we have that proj^C is a depth-{£ — 1) circuit with bottom 

fan-in , and has the same number of gates at distance at least two from the input variables 
as C. 

Proof. The lemma follows from applying Proposition 9.2 with r = s = and a union bound 
over all gates of C (at most S many) that are at distance 2 from the input variables. □ 

The following proposition directly implies Theorem 13 by straightforward translation of param¬ 
eters, recalling (5): 
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Proposition 9.11. For 2 < d < ; let C : {0,1}"^'^ —>• {0,1} be a depth-d circuit with 

bottom fan-in and size S < 2^^^^. Then '^{C) is computed by a depth-{w^/^) decision tree with 
probability 1 — 

Proof. Applying Proposition 9.1 with r = and s = to each of the bottom-layer gates of 
C, we have that projp(d) C is a depth-((i — 1) circuit with bottom fan-in with probability at 

least 1 — S ■ > 1 — over TZmit- If d = 2, we observe that in fact Proposition 9.1 

gives us that projp(d) C is a decision tree of the desired depth, and we are done. If d > 3, the claim 
follows by a union bound over d — 2 applications of Lemma 9.10 (where we observe from the proof 
of Lemma 9.10 that in the last application of Lemma 9.10 we may conclude that ^(C) is in fact a 
decision tree of depth □ 

Next we turn to Theorem 14. We require the following standard lemma showing that any circuit 
can be “trimmed” to reduce its bottom fan-in while changing its value on only a few inputs: 

Lemma 9.12. Let C : {0,1}” —>■ {0,1} be a circuit and let e > 0. There exists a circuit C : 
{0,1}” —)• {0,1} such that 

1. The size and depth of C are both at most that of C; 

2. The bottom fan-in of C is at most log{S/e); 

3. C and C are e-close with respect to the uniform distribution. 

Proof. C is obtained from C by replacing each bottom-level AND (OR, respectively) gate whose 
fan-in is too large with 0 (1, respectively). Each such gate originally takes its minority value on at 
most an e/s' fraction of all inputs so the lemma follows from a union bound. □ 

The following proposition directly implies Theorem 14 (by straightforward translation of pa¬ 
rameters) : 

Proposition 9.13. For 2 < d < , let C : {0,1}"^^ —> {0)1} be a depth-d circuit of size 

S' < 22 ”' and unbounded bottom fan-in. 

1. If the top gate of C is an AND, then ’J'(C) is {1/S)-close to a width-{w^^^) CNF with proba¬ 
bility 1 — . 

2. If the top gate ofC is an OR, then ^{C) is (l/S)-cZose to a width-{w^^^) DNF with probability 

1 _ e-0(u,i/5)_ 

Proof. By symmetry it suffices to prove the hrst claim. Applying Lemma 9.12 with e = 1/S, we 
have that C is (l/S)-close to a circuit C' : {0,1}"^'* —)• {0,1} of size and depth at most that of C, 
and with bottom fan-in log(S'/e) = 21og(5) < Certainly the size, depth, and bottom fan-in of 

projp(d) C is at most that of C with probability 1 over the randomness of ■<— T^init (note that 
unlike in the proof of Proposition 9.11, we do not argue that the depth of C decreases by one under 
an IZinit-^Budom. projection; the bottom fan-in of C is too large for us to apply Proposition 9.1). 
If d = 2 then this already gives the result (in fact with no failure probability). If d > 3, the 
proposition then follows by a union bound over d — 2 applications of Proposition 9.10. □ 
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10 Sipser retains structure under random projections 

Now we turn our attention to the randomly projected target 'J'(Sipser^). As discussed in Section 7.3, 
we would like to establish Property 2 showing that Sipser^ “retains structure” under a 'J'-random 
projection: with high probability over the randomly projected target ’J'(Sipser^) is a depth-one 
formula whose bias remains very close to 1/2 (with respect to an appropriate product distribution 
over {0,1}"'°). This is necessarily a high-probability statement; to establish it, we must account for 
the failure probabilities introduced by each of the d—1 individual random projections projp(fc) that 
comprise To reason about these failure probabilities and carefully account 

for them, in Section 10.1 we introduce the notion of a “typical” restriction and prove some useful 
properties about how typicality interacts with our random projections. In Section 10.2 we use these 
properties to establish the main results of this section, that Sipser^ “retains structure” when it is 
hit with the random projection 

10.1 Typical restrictions 

Recalling the *,0 notation from Table 2, we begin with the following definition: 

Definition 14. Let r G {•, o, where 2 < k < d — 1. We say that r is typical if it satisfies: 

1. For every a G Ak_i the set C is k-acceptable, where we recall from Definition 8 

that this means 

kQ~^(*)l = QW ± where fi{k, d) := ^ -|- 

(Note that 1 < /3{k,d) < ^ | for all d £ and 2 < k < d — 1.) We observe that by 

Definition 7, this condition implies that for every a G Ak- 2 , we have 

fa G {*,o}""=-2. (24) 

2. For every a G Ak- 2 , 

|(fa)"H*)| > Wk-2 - 

We note that (24) and Condition (2) together imply that 

Ta = * for all a G Ak- 2 - 

See Figure 2 on the next page for an illustration of a typical r. The rationale behind Definition 14 
is that projections proj^ such that p is typical have a very limited (and well-controlled) effect on 
the target Sipser^: roughly speaking, these projections “wipe out” the bottom-level gates of the 
formula (reducing its depth by one), “trim” the fan-ins of the next-to-bottom-level gates from w 
to approximately qw = 0(\/ul), but otherwise essentially preserves the rest of the structure of the 
formula. We give a precise description in Section 10.2; see Remark 16. 

^As a concrete example of a failure event, consider an outcome G suppfRinit) = {0,1, which is 

such that (0) is nonempty for all b G Ad-i- In this case 

Projp(d) Sipser^ = proj (Sipser^ f = 0 

(recall that the bottom-level gates of Sipser^ are AND gates), and our target function is set to the constant 0 already 
after the first Rinit-random projection. 
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Figure 2: The figure illustrates a typical r G {•, o, For a G Afc-i, "^a is a block of length 

Wk-i, i.e. a string in {•, o, . We may think of the block Ta as being located at level k. By 

Condition (1) of Definition 14, for every a G Ak-i we have that |'rT^(*)|) the number of *’s in 
Tq, is roughly qw = Q{^/w). The lift r of r is a string in and for a G Ak_ 2 , Ta 

is a block of length Wk- 2 - We may think of the block Tq as being located at level k — 1. As 
stipulated by (24), for every a G Afc_ 2 , the string belongs to {*, o}’^'=-2. By Condition (2) of 
Definition 14, for every a G Ak- 2 , we have that |(Ta)~^(*)|, the number of *’s in Ta, is at least 
Wk -2 — = rcfc_ 2 (l — o(l)). Finally, we observe that (24) and Condition (2) of Definition 14 

imply that r« = * for every a G Ak- 2 - 
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To prove that T'(Sipser^) is a well-structured formula with high probability over the random 
choice of Tf EE we will in fact establish the stronger statement showing that with high 

probability, every single one of the individual random projections projp(fe) only has a limited and 
well-controlled effect (in the sense described above) on the structure of Sipser^. By Definition 14, 

this amounts to showing that the lifts , • • •, p^^^ associated with the d—\ individual projections 
comprising are all typical with high probability. We prove this inductively: we first show that 
for p^'^) Tiinit its lift pf'^) is typical with high probability (Proposition 10.1), and then argue 
that if is typical then the lift of pi*^^ ^ 7?.(p(^+^)) is also typical with high probability 

(Proposition 10.2). The parameters of Definition 14 are chosen carefully so that it “bootstraps” 
in the sense of Proposition 10.2; in particular, this is the reason why we allow more and more 
deviation from qw in Condition 1 as /c gets smaller (closer to the root). 

Our two main results in this subsection are the following: 

Proposition 10.1 (Establishing initial typicality). Suppose that 3 < d < for a sufficiently 

small absolute constant c > 0. Then 

Pr [p is typical\ > 1 — 

Proposition 10.2 (Preserving typicality). Suppose that 3 < d < for a sufficiently small 

absolute constant c > 0. Let 2 < k < d — 1 and let r G {•, o, be typical. Then 

Pr [p is typical} > 1 — 


10.1.1 Establishing initial typicality: Proof of Proposition 10.1 

For notational brevity, throughout this subsubsection we write r to denote p G {0,1, where 

p ^ 'R-init- We proceed to establish the two conditions of Definition 14. 

Lemma 10.3 (Condition (1) of typicality). Fix a G Ad- 2 - Then 

Pr [|t-H*)| =qw± > 1 - 

Proof. Recalling (11), we have that 

Pr[Ta,i = *] = q independently for all i G [ic]. 

We shall apply Fact 5.1 with 

S = Zi -I- • • • Z-u) where Zj {0i_g, Ig} (so p = E[S] is qw), 
and 7 such that yp = Observe that since p = qw = ©((tclogrc)^/^), we have 7 = 

0(y;-l/6(log 

w) ^/^). Hence by Fact 5.1 we have that 

Pr [||t“^(*)| — qw\ > < exp (—^(y^p)) = exp ( — Ll[w^^^)). □ 
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The following observations may help the reader follow the next proof: Recalling Table 2, since 
our T belongs to {0,1, , we see that r corresponds to the second row of the table: the gates at 

depth d—2 are OR gates, a o-value for a coordinate of r corresponds to 0, and a "-value corresponds 
to 1. However, since r, the lift of r, is one level higher than r in the Sipser^ formula (see Figure 2), 
T corresponds to the first row of the table; so when Definition 7 specifies a coordinate Ta^i of r, a 
o-value for corresponds to 1 and a "-value corresponds to 0. 

Lemma 10.4 (Condition (2) of typicality). Fix a G Then 

Pr [|(t«)“^(*)| < Wd -3 - 

Proof. Recall from Definition 7 that Ta,i = 0 iff Ta,i = {0}"''^“^ (in order for an OR to be 0, all its 
inputs must be 0). In turn, each coordinate of Ta^i (we emphasize that Ta,i is a string of length w) 
is an AND of the w coordinates of some from (11), and hence is 0 with probability 1 — \ — q. 
By independence we have that 

Pr[Ta,i = 0]=6:={l-\-q)^ <{1- q)^ < e"'?"' (25) 

holds independently for all i G 

We next give an expression for Pr = l]. From Definition 7 we have that Ta,i = 1 iff any of 
the w coordinates of Ta,i is 1 (in order for an OR to be 1, we only need one input to be 1). As noted 
above, each coordinate of Ta,i is an AND of the w coordinates of some from (11); this AND is 
1 iff its input string is {1}'^, so by (11) each coordinate of Ta,i is not 1 with probability 1 — A. 
Hence all w coordinates of Ta,i are not 1 with probability (1 — A)"', and Ta,i = 1 with probability 
l-(l-A)"'. 

We thus have that, independently for all i G [rc^-s], 

Pr G {0,1}] = (5 + (1 - (1 - A)"') < ,5 + (1 - (1 - Au;)) < 2Au; = 

where the last inequality holds (with room to spare) by (25). Applying Fact 5.1, we have that 

Pr [|r-^({0,1})| > 

with room to spare. □ 

Proof of Proposition 10.1. The proposition follows immediately from Lemmas 10.3 and 10.4 and a 
union bound over all a G A(i -2 and a G Arf_ 3 , using the fact that |Arf_ 3 | < |Arf_ 2 | <n< and 

the bound d < . □ 

— log log w 

10.1.2 Preserving typicality: Proof of Proposition 10.2 

The following numerical lemma relates qa as defined in (13) of Definition 9 to as defined in (7): 

Lemma 10.5. Let 2 < k < d — 1 and S C he k-aceeptable (i.e. |5| = qw Pl ), and 

define 

(1-4)1^1 - A 
^ tk-i 

Then q' = q ■ {1 P: 2tkW^^^’'^'^). (And in particular, by our bounds on tk in Lemma 7.1 and the 
definition of fi{k, d), we have that = q i: o{q) for all k.) 
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Proof. For the lower bound, we have the following: 


, ^ (1 - _ A 

^ ~ tk-i 


^ (1 - - A(i - 

^ ( 1 - 4 )^"-A 

“ tfc_i(l - 4)"'^'''’"' 4-i(l - 

_ 4-ig _ _)^4^,/3(fc,d) 

“ 4-i(l - 4)"''"'''’"' 4-i(l - 4)^'"^"’'*^ 


< 


+ ■ 


l + 3g' 


0.1 






<q-{l + 2tkW^^^'^'^), 


(by (8)) 


(by Lemma 7.1) 


where for the last inequality we have used the fact that qt^ = @{w whereas X = @{w For 
the upper bound, we have 

, (1 - _ A 

^ ~ 4-1 

^ (1-4)9^(1-4u;/3(M))-a 

~ 4-1 

>q.{l-t,wf^(k,d)^--^ (by ( 8 )) 

tfc-i 

>g-(1-2411^^^^’'^^). 

where the last inequality uses the dehnition of A in (7) and our bound on tk-i in Lemma 7.1. □ 

Similar to the proof of Proposition 10.1, Proposition 10.2 follows from Lemmas 10.6 and 10.8 
(stated and proved below) and a union bound, again using the fact that each \Ai\ < n and the 
bound d < 1 ^ 7 ^^ ■ Since Proposition 10.2 deals with general values of k which may correspond to 
either row of Table 2, to avoid redundancy we use o, • notation in the statements and proofs of the 
following lemmas. 

Lemma 10.6 (Condition (1) of typicality). For 2 < k < d — 2 let t (z {•, o, he typical and 

fix a & Ak-i- Then 

Pr [|(pJ-H*)| = > 1 - exp(-fi(ii;2^(^’‘^)-i)) 

P^np) 

(Recall that from Definition I 4 that fi{k, d) = ^ + '^~i 2 d^ )■ 

Proof. Since r G {•, o, >i:}"^fc+i is typical, we have that 

faG{*,o}"' and \{Ta)~^{*)\ > w - (26) 
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by the second and third property of r being typical. Fnrthermore, for every i G [re] such that 
Ta,i = *, we have that 

and qw - < \{Ta,i)-H*)\ < qw + (27) 

by the first property of r being typical. Writing Sa^i for (a subset of [tc]) and Sa for 

(ra)“^(*) (a subset of [tc]), it follows from the second branch of (12) and Definition 7 that every 
i G Sa satisfies 


Pr , [Pa,i = *] = qa,i = 

P*r-n{T) 


(1 - - A 


t 


Since Sa,i is (fc + l)-acceptable, by the A: + 1 case of Lemma 10.5 we have that 

g,,, = g-(l±2tfe+in;^("+i’")). 


Since |S'a| < u), we have 


E. , [l(Pa) H*)\] = Y1 ^ 


• q{l + 24+iu>^(''+^’'^^) <qw + 


where the O comes from the fact that wtk+iq = ©(logtc) (recalling Lemma 7.1 we have that 
tk+i = ^ ± o{q)). On the other hand, by (26) and similar reasoning we also have the lower bound 

E [|(pj-'(*)|] >{w- w^/^) • q{l - 24+in;^("+i’'')) > qw - 0(n;^("+i’'')), 

p^7^{r) 

where we have taken advantage of the fact that w^^^q = 0{w^'^) = Since = 

a;(polylog(t(;) • (here is where we are using the fact that d < it follows from 

Fact 5.1 that 


Pr [|(pj“^(*)| T^qw± < exp(-D(n;2/3(Lrf)/gy^)) 

P*r-n{T) 

<exp(-D(rc2/3(M)-|)). □ 

Lemma 10.7. Fix 2 < k < d — 2 and let r G {•, o, be typical. For each a G Ak-i we write 

Sa = ‘S'a(r) to denote (7)i)~^(*) (note that this is a subset of [w]). Then for p <— TI{t), we have 
that Pa (which is a string in {•, o, satisfies: 

{ Pa = {o}"' with probability ^ 

(Pa)~^(*) / 0 with probability 1 — (1 — A)l'^“l 

Pa £ {O) *}^ \ otherwise, 

independently for all a G A^-i- (Recall that Ta G {*, o}'^ \ {o}"' for all a G A^-i since r is typical.) 
This implies that 

{ • with probability — A — qa^i) 

o with probability 1 — (1 — A) II 
* otherwise 

independently for all a G Ak-i. (Recall that fa = * for all a G Ak-i since r is typical.) 
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Proof. The value of j is independent across all a G and i G [tc] such that Ta,i = *• Fix such 
a a G Ak-i and i G [w], and recall that 


Ta,G{*,or\{or. 

By (12) and Definition 7 (the definition of the lift operator), we have that 

{ • with probability A 
=t= with probability qa,i 
o otherwise, with probability 1 — A — qa,i. 

The lemma then follows by independence. 


□ 


Remark 15. If r G {•, o, is typical then (recall that Sa = {%) ^(*) is a subset of [re] and 

Sa,i = ('ra,i)~^(*) is a subset of [tc]) we have 

\Sa\ >w- and qw - < qw + for all i G Sa- 

Therefore we have the estimates 

Pr pa = •] = n ^ ^ < (1 - 

i&Sa 

where we have used Lemma 10.5 for the second inequality, and 


Pr 


Pa = ° 


= 1 - (1 - A)l^“l < 1 - (1 - A)’^ < 1 - (1 - Aw) = Aw. 


Lemma 10.8 (Condition (2) of typicality). For 2 < k < d — 2 let t £ {•, o, he typical and 

fix a G Ak- 2 - Then 

Pr [\{fia)~\*)\>Wk- 2 -w^y = l-e-^^^\ 

P-^n{T) 


Proof. By Lemma 10.7 and the two estimates of Remark 15, each coordinate of (p)a is indepen¬ 
dently in {•, o} with probability at most -|- Aw = O ( ^ ) • Hence the expected size of 

\{pa) ^({•!°})| is 0(w^/^), and we may apply Fact 5.1 to get that 

pj-,, [|(?»)'‘({*.°})l >*»*'''■] < 

with room to spare. □ 


10.2 Sipser survives random projections 

In this subsection we prove the main results of Section 10; these are two results which show, in 
different ways, that the Sipser^ function “retains structure” after being hit with the random projec¬ 
tion The first of these results. Proposition 10.II, gives a useful characterization of T'(Sipser^) 
by showing that it is distributed identically to a (suitably randomly restricted) depth-one formula. 
The second of these results. Proposition 10.13, shows that this randomly restricted depth-one for¬ 
mula is very close to perfectly balanced in expectation. Our later arguments will use both these 
types of structure. 
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10.2.1 Sipser^ reduces under 'J' to a random restriction of SipseQ ^ 

(k) 

Recalling the definitions of the depth-/c Sipser^ formulas from Definition 5, we begin with the 

(k) 

following observation regarding the effect of projections on the Sipser^ formulas: 

Fact 10.9. For 2 < k < d we have that 

■ c (^) c- (^“1) 

proj Sipser^ = SipseQ 

In words, Fact 10.9 says that the projection operator “wipes out” the bottom-layer gates of 

r* (k) 

Sipser^ , reducing its depth by exactly one. Fact 10.9 is a straightforward consequence of the 
definitions of projections and the Sipser^^^ formulas (Definitions 4 and 5 respectively), but is perhaps 
most easily seen to be true via the equivalently view of projections described in Remark 9: for 
every bottom-layer gate a £ Ai~ oi SipseQ , the projection operator simply replaces every one 
of its Wk-i formal input variables ■ ■ • ,Xa,wk-i with the same fresh formal variable i/a- Since 
AND(ya ,... a) = OR{ya,... ,ya) = ya, the gate simplifies to the single variable ya- (Indeed, we 
defined our projection operators precisely so that they sync up with Sipser^ this way.) 

The same reasoning, along with the definition of lifts (see Definition 7 and the discussion after), 
yields the following extension of Fact 10.9: 

Fact 10.10. For 2 < k < d and p £ {0,1, we have 

projpSipsery = Sipser), ' \ p. 

Remark 16. With Fact 10.10 in hand we now revisit our definition of typical restrictions (recall 
Definition 14 and the discussion thereafter). Recall that the high-level rationale behind this defini¬ 
tion is that for p such that p is typical, the projection proj^ has a “very limited and well-controlled 
effect” on the target Sipser^. We now make this statement more precise (the reader may find it 
helpful to refer to the illustration in Figure 2). 

Fix p £ supp(77.init) such that p is typical. By Fact 10.10, we have that 

proj^Sipser^ = Sipser^"*"^^ ( p. 

Since p is typical, 

- The first condition of Definition 14 implies that |(pa)~^(*)| = Qigw) = Q{^/w) for all a £ 
AcI- 2 - Each such a £ Aci -2 is the address of an OR gate, and so if (pa)~^(l) 7^ 0 the gate 
is satisfied and evaluates to 1, and otherwise if pa £ the value of the gate remains 

undetermined (i.e. it “evaluates to *”) and its fan-in becomes |(pa)~^(*)| = Q{^/w). 

~ The second condition of Definition 14 tells us that between the two possibilities above, the 
latter is far more common: for every a £ A^-s specifying a block of Wci -3 many OR gates, at 
most of these gates evaluate to 1 and the remaining (vast majority) are undetermined. 
Equivalently, all the AND gates at level d — 3 remain undetermined, and they all have fan-in 
at least Wds — = Wd -3 (1 — o(l)). 

The same description holds for projp{fc) and Sipser^^\ For pill's that are typical the projection 
operator projp(fc): 
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- “wipes out” the bottom-level (level-/c) gates of Sipser^ , 

- “trims” the fan-ins of the level-(A: — 1) gates from w to Q{^yw), 

- keeps the fan-ins of all level-(A: — 2) gates at least Wk -2 — (1 ~ o(l))- 

Note in particular that the entire structure of the formula from levels 0 through A: — 3 is identical 
to that of Sipser^^\ and so projp(fe) Sipser^^^ “contains a perfect copy of” Sipser^^ 

Repeated applications of Fact 10.10 gives us the following proposition. (The proposition is 
intuitively very useful since, it tells us that in order to understand the effect of the random projection 
on the (relatively complicated) Sipser^ function, it suffices to analyze the effect of the random 
restriction on the (much simpler) Sipser^^^ function; we will apply it in the final proof of each 
of our main lower bounds.) 

Proposition 10.11. Consider Sipser^ : {0,1}"' —{0,1}. Then 

’J'(Sipser^) = Sipser^^^ f p(2). 


Proof. By Fact 10.10 we have that 

projp{d) Sipser^ = Sipser^'^"^^ f pW (28) 

for all e supp(7^init) = {0,1, *}"■. Furthermore for p(^+^) e {0,1, and p^^^ G supp(7^(p(*^+^))) C 

{0,1, we have 

proj (^(Sipser^^^ f pl^+i)) ( p(^)^ 
proj (^Sipser^^^ f p(^)^ 

Sipser^^”^^ ( p(^), (29) 

where the first equivalence is by the definition of p-projection (Definition 4), the second is by the 
fact that 7^(p(*^+^)) is supported on refinements of p(^+i) (and in particular, p^^^ refines p(^+^)), 
and the last is Fact 10.10. The proposition follows from (28), repeated application of (29), and the 
definition of T' (Definition 10). □ 

10.2.2 Sipser^ remains unbiased after random projection by T' 

Recall that Sipser^^^ denotes the function computed by the top gate of Sipser^, and in particular, 
Sipser|^^^ is a tco-way OR if d is even, and a tco-way AND if d is odd (c.f. Definition 5). In this 
subsubsection we will assume that d is even; the argument for odd values of d follows via a symmetric 
argument. 

To obtain our ultimate results we will need a lower bound on the bias of T'(Sipser^) under Y (or 
equivalently, by the preceding proposition, on the bias of Sipser^^^ \ pO) where p^^^ is distributed 
as described in Definition 10). The following lemma will help us establish such a lower bound: 
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Lemma 10.12. Let r G {0,1, he typical. Then for p ^ TZ{t) andY we have 


E 

p 


1 


bias(Sipser^ fp,Y) >- — 0{w 


Proof. By our assumption that d is even we may write in place of Sipser^^^ Since r is typical, 

we have by Conditions (2) and (3) of Definition 14 that 

fG{0,=t:}"'° and |(r)“^(=t:)| > mo — 

Furthermore, by (12) of Definition 9 and Definition 7 (the definition of the lift operator), we have 
that 

{ 1 with probability A 

* with probability qi (30) 

0 otherwise, with probability 1 — X — Qi 

independently for all i G (^)~^(*) C [mo], where 


di = 


(1 - A 


and Si = Si{T) = ^(=t:) = {j G [mi] : = *} satisfies \Si\ = qw ± By a calculation very 

similar to the one that was employed in the proof of Lemma 10.6, we have that 

Pr [l(p)~^(*)l = qwo ± > 1 — (31) 

Furthermore, (30) also implies that 

Pr[p G {0, *}’"°] = (1 - A)l^"'(*)l > (1 - X)^° > 1 - Amo = 1 - ©(m-^/^). (32) 


Fix any p G supp(7?.(r)) that satishes the events of both (31) and (32). Writing S{p) C [mo] to 
denote the set (p)~^(*), we have the bounds 

Pr[(OR^„ [p)(Y) = 0] = (l-ti)l^(^)l 


> (1-ti) 


1 




log m 


m 


(l-ii) 


„a(l,d) 




>^-6(u>w.2), 

where the second inequality crucially uses the definition (4) of wq and its corollary (9). Similarly, 

Pr[(OR^„ [p)(Y) = 0] = (l-ti)l^(?)l 

< (^1 _ 

< h (1 - ti)-'"’"'"' 


which establishes the lemma. 


□ 
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Now we are ready to lower bound the expected bias of ’J'(Sipser^) (or equivalently, of Sipser^^^ \ 
p(2)) under Y: 

Proposition 10.13. For ^ as defined in Definition 10, 

^(/) = projp(2) projpO) • • -projpCd-i) projp(d) / 


where •«— T^mit and for all 2 < k < d — 1, and for Y ^ 

have that 

bias(Sipser^^^ f pl^l^Y) > ^ 


we 


E 


Proof. By Proposition 10.1 and d — 3 successive applications of Proposition 10.2, we have that 

Pr ..., p(^) are all typical] > 1 — d • 

For every typical G {0,1, Lemma 10.12 gives that 


E _ 


bias(SipseA^^ ( pC^fY) > — 0{w 


which together with the preceding inequality gives the proposition. 


□ 


Remark 17. We note that combining Proposition 10.11 and Proposition 10.13, for Y {0i_i^, lti}^° 
we have that 

E [bias(’J'(Sipserrf), Y)] > 


which we may rewrite as 


Pr[('J^(Sipser,))(Y) = 0]=E 


Pr(^(Sipserrf))(Y) = 0] =^± d{w 


Applying Proposition 8.1, we get that for X ^ {O 1/21 11 / 2 }"' have 

Pr[Sipserrf(X) = 1] = 1 ± 


verifying (6) in Section 6: the Sipser^ function is indeed (essentially) balanced. 


11 Proofs of main theorems 

Recall that Sipser^^^ denotes the function computed by the top gate of Sipser^, and in particular, 
Sipser^^^ is a tCQ-way OR if d is even, and a rcQ-way AND if d is odd (c.f. Definition 5). Throughout 
this section we will assume that d is even; the argument for odd values of d follows via a symmetric 
argument. For conciseness we will sometimes write OR«,(, in place of Sipser^^^ in the arguments 
below; we stress that these are the same function. 
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11.1 “Bottoming out” the argument 

As we will see in the proofs of Theorems 6 and 7, the machinery we have developed enables us 
to relate the correlation between Sipser^ and the circuits C against which we are proving lower 
bounds, to the correlation between Sipser|^^^ \ (obtained by hitting Sipser^ with the random 
projection and bounded-width CNFs (that are similarly obtained by hitting C with T'). To 
finish the argument, we need to bound the correlation between Sipser^^^ \ r (for suitable restrictions 
r) and such CNFs. The following proposition, which is a slight extension of Lemma 4.1 of [OW07], 
enables us to do this, by relating the correlation between Sipser^^^ ( r and such CNFs to the bias 
of Sipser^^^ f r. 

Proposition 11.1. Let F : {0,1}'^° —>■ {0,1} be a width-r CNF and r G {0,*}’^° \ {0}"'°. Then 

Pr[(OR^o r t)(Y) / F{Y)]] > bias(OR^o ( r, Y) - r^. 

Proof. Writing S = S{t) C [icq] to denote the set r“^(*), we have that OR^g \ r computes the 
|5|-way OR of variables with indices in S (note that A / 0 since r G {0, \{0}"'°); for notational 

brevity we will write OR 5 instead of OR^g ( r. 

We begin with the claim that there exists a CNF F' : {0, Ij’^o —)> {0,1} of size and width at 
most that of F, depending only on the variables in S, such that 

Pr[ORs(Y) + F(Y)] > Pr[ORs(Y) / F'(Y)]. (33) 


This holds because 


Pr[0R5(Y) / F{Y)] 


E 


Pr[{ORs \ p){Y) ^ {F \ p){Y)] 


E 


Pr[ORs{Y) ^ {F \ p)iY)] 


and so certainly there exists p G {0, such that F' := F \ p satisfies (33). Next, writing 

{yi}i&s to denote the formal variables that both OR^ and F' depend on, we consider two possible 
cases: 


1. For every clause T in F' there exists i € S such that occurs in T. In this case we note that 
^'( 0 "^) = 1 (whereas OR 5 ( 0 '^) = 0 ), and so 

Pr[0R5(Y) / F'(Y)] > Pr[Yi = 0 for all z G S] = Pr[ORs(Y) = 0]. 


2. Otherwise, there must exist a monotone clause T in F' (one containing only positive occur¬ 
rences of variables) since F' depends only on the variables in S. In this case, since each 
unnegated literal is true with probability ti (recall that Y ■<— {0i_ and T has width 

at most r, by a union bound we have that 

Pr[F'(Y) = 1] < Pr[r(Y) = 1] < rti, 

and so 

Pr[0R5(Y) / F'{Y)] > Pr[ORs(Y) = 1] - Pr[F'(Y) = 1] > Pr[0R5(Y) = 1] - rh. 
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Together, theses two cases give us the lower bound 


Pr[0R5(Y) ^ F'{Y)] > min { Pr[0R5(Y) = 1], Pr[0R5(Y) = 0] - rh} 
> min{Pr[ORs(Y) = l],Pr[0R5(Y) = 0]} - rti, 

which along with (33) completes the proof. 


□ 


11.2 Approximators with small bottom fan-in 


The pieces are in place to prove the first of our two main theorems, showing that Sipser^ cannot be 
approximated by depth-d size-5 circuits with bounded bottom fan-in: 

Theorem 6. For 2 < d < , the n-variable Sipser^ function has the following property: Let 

1 

C : {0,1}*^ —)• {0,1} be any depth-d circuit of size S = and bottom fan-in Then 

for a uniform random input X {0i/2) ^ 1 / 2 }'^> have 

Pr[Sipser^(X) / 17(X)] > ^ - 

Proof. Let Y ^ We successively apply Proposition 8.1 and Proposition 10.11 to 

obtain 


Pr[Sipserrf(X) / C(X)] = E 


Pr[(^(Sipser,))(Y) / ('l'(C))(Y)] 


= E 




Pr[(OR^„ rP^))(Y)/(T'(C))(Y)] 


(for the second equality, recall that SipserJ^ is simply OR^g, by our assumption from the start 
of the section that d is even). For every possible outcome 'L of T' (corresponding to successive 
outcomes of for ..., for p^‘^'1) and every r G N, we have the bound 

Pr[(OR^„ r?^)(Y)^(^(C))(Y)] 

> Pr[(OR^o r /^)(Y) ^ (l'((P))(Y) I T(C) is a depth-r DT] - 1[^(C) is not a depth-r DT] 

> bias(ORiu(, \ Y) — rti — 1['I'(C') is not a depth-r DT], 


where the final inequality is by Proposition 11.1 along with the fact that every depth-r DT can be 

1 

expressed as either a width-r CNF or a width-r DNF. Setting r = and taking expectation 

with respect to we conclude that 


E 


Pr[(OR^„ rp^)(Y)/(T(C))(Y)] 


> E 


bias(OR^o f p(2),Y) 


rti — ^r[T'(C') is not a depth-r DT] 


> ^ — 0(w _ gxp )^ 

1 1 

where the second-to-last inequality uses both Proposition 10.13 and Theorem 13, and the last claim 
follows by simple substitution, recalling the values of r, ti and w in terms of n and d. □ 
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11.3 Approximators with the opposite alternation pattern 

Our second main theorem states that Sipser^ cannot be approximated by depth-d size-S circuits 
with the opposite alternation pattern to Sipser^: 

Theorem 7. For 2 < d < ; the n-variable Sipser^ function has the following property: Let 

1 

C : {0,1}"' —)• {0,1} be any depth-d circuit of size S = and the opposite alternation pattern 

to Sipser^, (i.e. its top-level gate is OR i/Sipser^’s is AND and vice versa). Then for a uniform 
random input X {^ 1 / 2 -, I 1 / 2 }”; we have 

Pr[Sipser^(X) / 17(X)] > ^ - 

Proof. By our assumption that d is even, we have that the top gate of Sipser^ is a rro-way OR, 
whereas the top gate of C is an AND. Let Y ^ As in the proof of Theorem 6, we 

successively apply Proposition 8.1 and Proposition 10.11 to obtain 


Pr[Sipser,(X) / C'(X)] = E 


Pr[('l^(Sipser,))(Y) + (^'(C))(Y)] 


= E 




Pr[(OR^„ rp(^)(Y)/(^^(C))(Y)] 


For every possible outcome 'L = .,..., p^'^ of '1' and every r G N we have the bound 

Pr[(OR^o i^)(Y)/(l/(C))(Y)] 

> Pr[(OR^o r /(^)(Y) / (^(0))(Y) I l'(C) is (l/5)-close to a width-r CNF] 

— 1['I'(C') is not (l/5)-close to a width-r CNF] 

> bias(OR^(, \ pO), Y) — rti — (I/S') — 1['I'(C') is not (l/S)-close to a width-r CNF], 


where the final inequality is by Proposition 11.1. As in the proof of Theorem 6, setting r = 
and taking expectation with respect to we conclude that 


E 


Pr[(OR^o rp(2))(Y)/(^(C))(Y)] 


> E 


bias(OR^o rp(2)^Y)]} 


— (1/S) — rti — Pr[^'(C') is not (l/S)-close to a width-r CNF] 




> ^ — 0{w — (1/S) — rti — exp n(n®(‘^-i) )^ 


1 1 

2 ’ 


where the second-to-last inequality uses both Proposition 10.13 and Theorem 14, and the last claim 
follows by simple substitution, recalling the values of r,ti,w and S in terms of n and d. □ 
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A Proof of Lemma 7.1 

Lemma 7.1. There is a universal constant c > 0 such that for 2 < d < we have that 

tk = q q^'^ for all k G [d — 1]. 

Proof. We shall establish the following bound, for A: = d — 1,..., 1, by downward induction on k: 

\tkq-p\ < {2m)'^~^~'"\. (34) 

Lemma 7.1 follows directly from (34), using (7), (3) and the fact that p = ©(^^^f^). 

The base case A; = d — 1 of (34) holds with equality since (8) gives us that \td-iq — p\ = A. 

For the inductive step suppose that (34) holds for some value k = I + 1. By (8) we have that 

tiQ = (1 — tf+i)'?"' — A, so our goal is to put upper and lower bounds on (1 — — A that are 
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close to p. For the upper bound, we have 

/ 1 \ qwti+i 

(1 — — A = f (1 “ i^+i) ] — A 

< exp {-qwte+i) - A 

< exp (^—w (^p — (2m)'^“^“^A^^ — A 
'm2”" 


< exp — 


log e 


-1 ) • (2-”"-(2m)'^-^-2A) ) - A 


= 2-”" • exp ( 2-"" + ( ^ - 1 ) • (2m)'^-^-2A ) - A 


<p- 1 + 2 


1—m+l 


+ 


log e 
m2”"+i 
log e 


(2m)'='-^-^A - A 


(by Fact 5.3) 
(by the inductive hypothesis) 

(by (3)) 


(by Fact 5.3) 


<P + 2 


— 2m+l 


+ 


2m 


log e 


(2m 


,d-i-2 


A- A 


<p+{2mf-^-^\, 


where in the last inequality we have used the fact that A = 0(2 
For the lower bound we proceed similarly: 


1 \ qwti+x 

(1 — — A = { (1 — t£+i)‘^+i I — A 


> exp {-qwti+i) ■ (1 - - A 

> exp (^—w (^p + (2m)'^“^“^A^^ • (1 — qw{ti^i)‘^) — A 


> exp — 


m2”" 
log e 


(2-”" + (2m)“A)) • (1 - qw{te+if) - A 


(by Fact 5.3) 
(by the i.h. & Fact 5.2) 


T71 2 ^^ 

= 2“”" • exp ( -• (2m)‘^“^“^A ) • (1 - qw{ti+if) - A 

777 2 ^ 

> 2“”" • ( 1 — ^-• (2m)'^“^“^A — qw^ti+i)"^ ) — A (using Fact 5.2) 


> 2“”" • I 1 - 

> 2“”" • ( 1 - 

= p 


log e 
m2”" 
log e 
m2”" 


log e 


(2m)'^-^-2A - ^ • (p + (2m)'^-^-2A)^^ - A 
(2m) 


. rf_£_ 2 , drcp" 


'A- 


- A 


m 


{2m, 

loge 

> p - (2m)'^-^-U. 


,-i-2 X 4u;p'^ 


'A- 


- A 


(by the i.h.) 
(by the bound on d) 

□ 
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